NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. They are looking for highly motivated Senior Software Engineers to join their Fabric Networking team with a focus on NVLink Rack-Scale Systems Stability & Reliability, contributing directly to the software foundation powering next-generation datacenter deployments.
Responsibilities:
- Drive platform bringup, feature enablement, end-to-end software validation, and debug for next-generation NVLink-based GPU and rack-scale systems
- Develop tools, diagnostics, automation, and infrastructure for system validation, regression testing, and fleet support
- Lead reliability and MTBI validation through stress testing, telemetry analysis, failure injection, and issue resolution
- Triage complex software, firmware, networking, and platform issues across validation, deployment, and production environments
- Collaborate with architecture, hardware, firmware, software, and Customer engagement teams to improve system quality and reliability
- Build and maintain SRE-style validation infrastructure, including provisioning, monitoring, and operational readiness
- Create automation, dashboards, runbooks, and debug workflows that improve root-cause analysis and operational efficiency
Requirements:
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or related field, or equivalent experience
- 5+ years of experience in system software, firmware, networking, platform enablement, data center infrastructure, or distributed systems
- Strong programming skills in C/C++ and Python; Bash/Shell scripting experience is a plus
- Strong system-level debugging across software, firmware, hardware, and networking layers
- Solid networking fundamentals, including TCP/IP, Ethernet and/or InfiniBand, RDMA/RoCE, routing, switching, and fabric performance analysis
- Experience with large-scale AI systems, including platform bringup, validation, reliability engineering, stress testing, telemetry analysis, and root-cause debugging
- Ability to triage complex multi-domain issues using logs, telemetry, experiments, and structured debugging methods
- Strong communication and collaboration skills across engineering, customer, and operations teams. Passion for building reliable next-generation AI infrastructure and solving complex system-level challenges at scale
- Experience with NVIDIA GPU systems, NVLink, NVSwitch, CUDA, and large-scale AI/HPC clusters such as NVIDIA GB200 NVL72
- Strong understanding of large-scale AI system architecture, including PCIe, memory hierarchy, DMA, high-speed interconnects, and distributed training/inference systems
- Experience with server management technologies, data center operations, cluster provisioning, scaling, and fleet monitoring
- Proven experience building diagnostics, automation, CI/CD pipelines, dashboards, and reliability tooling