NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. As a Senior Software Engineer - Datacenter Systems, you will join NVIDIA's software infrastructure team to design, build, and improve software systems for datacenter provisioning and management, focusing on large-scale GPU clusters.

Responsibilities:

Develop and manage software for hands-off datacenter provisioning and lifecycle management, including rack installation, bare-metal networking configuration, and cluster scaling
Build and implement scalable release train architectures that modularize systems and enable independent, reliable release cycles
Define, monitor, and enforce Service Level Indicators (SLI), Objectives (SLO), and Agreements (SLA) for core infrastructure services to ensure high availability and reliability
Develop intuitive user interfaces (UIs) and APIs for internal provisioning and management tools, making cluster operations and visibility more straightforward
Lead the technical requirement definition process, clearly articulating requirements, inputs, outputs, and quantifiable outcomes for new infrastructure features and system improvements
Build and maintain CI/CD pipelines that support fast, reliable integration and deployment across complex systems
Build tools and automation workflows that simplify software releases, manage dependencies, and increase reliability
Automate software updates and monitor system health to improve reliability and availability
Resolve operational issues across distributed infrastructure as well as manage firmware and software rollouts to minimize downtime and ensure consistency
Work with global engineering teams to align infrastructure tools and support project achievements

Requirements:

BS or MS in Computer Science, Computer Engineering, or a related field or equivalent experience
8+ years of experience managing infrastructure or systems in high-performance or distributed environments
Expertise in software programming using Python, Rust, C++, and Shell or similar high-level languages
Practical experience with modern CI/CD tools and infrastructure-as-code frameworks such as Jenkins, GitLab, Ansible, GitOps, and Kubernetes
Ability to use AI coding tools and agents effectively to increase your efficiency
Strong understanding of Linux, networking, and distributed system building
Ability to break down monolithic systems into scalable, loosely coupled components
Excellent communication and collaboration skills across multi-functional areas
Demonstrated experience implementing SRE practices, specifically defining and tracking SLIs, SLOs, and SLAs
Proficiency with observability tools such as Prometheus and Grafana for system health monitoring and analysis
Experience crafting user-facing components (front-end or CLI) for infrastructure management tools
Experience with cluster management tools like Slurm as well as familiarity with NVIDIA DGX systems and GPU-based clusters such as GB200, GB300, and VR-NVL72
Consistent track record leading DevOps process improvements and drive team efficiency

Senior Software Engineer - Datacenter Systems

Key skills

About this role

Responsibilities:

Requirements: