NVIDIA is building the operating model for reliable, scalable GPU infrastructure. They are seeking an Engineering Manager to lead a team focused on Kubernetes-based operations, automation, reliability, and cluster lifecycle tooling while enhancing production systems and engineering practices for DGX Cloud infrastructure.
Responsibilities:
- Lead a team of software and production engineers building and operating DGX Cloud infrastructure across NVIDIA Cloud Partner (NCP) and on-prem environments
- Drive execution across cluster operations, Kubernetes operability, automation, GitOps, observability, and incident response
- Help define team priorities, roadmap, staffing, and operational ownership
- Partner with platform, workload, storage, networking, security, and TPM teams to improve production readiness
- Build a healthy on-call and incident review culture focused on learning, ownership, and durable fixes
- Coach engineers, grow technical leaders, and create clear ownership across ambiguous problem spaces
Requirements:
- 8+ overall years of industry experience, including 2+ years leading or managing engineers
- Experience building or operating production infrastructure, cloud platforms, Kubernetes environments, or distributed systems
- Strong understanding of reliability engineering, automation, observability, incident response, and operational excellence
- Ability to work across teams and influence without direct authority
- Clear communication, strong prioritization, and sound judgment in fast-moving environments
- BS/MS in Computer Science or equivalent experience
- Experience leading SRE, production engineering, infrastructure automation, or platform teams
- Experience with GPU infrastructure, Kubernetes fleet operations, GitOps, BMaaS/VMaaS, managed Kubernetes, or multi-cloud environments
- Track record of reducing toil, improving SLOs, and turning operational work into software-driven systems