NVIDIA is building the operating model for reliable, scalable GPU infrastructure. They are seeking an Engineering Manager to lead a team focused on Kubernetes-based operations, automation, reliability, and cluster lifecycle tooling while enhancing production systems and engineering practices for DGX Cloud infrastructure.

Responsibilities:

Lead a team of software and production engineers building and operating DGX Cloud infrastructure across NVIDIA Cloud Partner (NCP) and on-prem environments
Drive execution across cluster operations, Kubernetes operability, automation, GitOps, observability, and incident response
Help define team priorities, roadmap, staffing, and operational ownership
Partner with platform, workload, storage, networking, security, and TPM teams to improve production readiness
Build a healthy on-call and incident review culture focused on learning, ownership, and durable fixes
Coach engineers, grow technical leaders, and create clear ownership across ambiguous problem spaces

Requirements:

8+ overall years of industry experience, including 2+ years leading or managing engineers
Experience building or operating production infrastructure, cloud platforms, Kubernetes environments, or distributed systems
Strong understanding of reliability engineering, automation, observability, incident response, and operational excellence
Ability to work across teams and influence without direct authority
Clear communication, strong prioritization, and sound judgment in fast-moving environments
BS/MS in Computer Science or equivalent experience
Experience leading SRE, production engineering, infrastructure automation, or platform teams
Experience with GPU infrastructure, Kubernetes fleet operations, GitOps, BMaaS/VMaaS, managed Kubernetes, or multi-cloud environments
Track record of reducing toil, improving SLOs, and turning operational work into software-driven systems

Engineering Manager, DGX Cloud Production Engineering

Key skills

About this role

Responsibilities:

Requirements: