NVIDIA has been transforming computer graphics and accelerated computing for over 25 years. As a Senior Software Engineer - Datacenter Systems, you will design, build, and improve software systems for datacenter provisioning and management, working with large-scale GPU clusters to support high-performance and AI workloads.
Responsibilities:
- Develop and manage software for hands-off datacenter provisioning and lifecycle management, including rack installation, bare-metal networking configuration, and cluster scaling
- Build and implement scalable release train architectures that modularize systems and enable independent, reliable release cycles
- Define, monitor, and enforce Service Level Indicators (SLI), Objectives (SLO), and Agreements (SLA) for core infrastructure services to ensure high availability and reliability
- Develop intuitive user interfaces (UIs) and APIs for internal provisioning and management tools, making cluster operations and visibility more straightforward
- Lead the technical requirement definition process, clearly articulating requirements, inputs, outputs, and quantifiable outcomes for new infrastructure features and system improvements
- Build and maintain CI/CD pipelines that support fast, reliable integration and deployment across complex systems
- Build tools and automation workflows that simplify software releases, manage dependencies, and increase reliability
- Automate software updates and monitor system health to improve reliability and availability
- Resolve operational issues across distributed infrastructure as well as manage firmware and software rollouts to minimize downtime and ensure consistency
- Work with global engineering teams to align infrastructure tools and support project achievements
Requirements:
- BS or MS in Computer Science, Computer Engineering, or a related field or equivalent experience
- 8+ years of experience managing infrastructure or systems in high-performance or distributed environments
- Expertise in software programming using Python, Rust, C++, and Shell or similar high-level languages
- Practical experience with modern CI/CD tools and infrastructure-as-code frameworks such as Jenkins, GitLab, Ansible, GitOps, and Kubernetes
- Ability to use AI coding tools and agents effectively to increase your efficiency
- Strong understanding of Linux, networking, and distributed system building
- Ability to break down monolithic systems into scalable, loosely coupled components
- Excellent communication and collaboration skills across multi-functional areas
- Demonstrated experience implementing SRE practices, specifically defining and tracking SLIs, SLOs, and SLAs
- Proficiency with observability tools such as Prometheus and Grafana for system health monitoring and analysis
- Experience crafting user-facing components (front-end or CLI) for infrastructure management tools
- Experience with cluster management tools like Slurm as well as familiarity with NVIDIA DGX systems and GPU-based clusters such as GB200, GB300, and VR-NVL72
- Consistent track record leading DevOps process improvements and drive team efficiency