NVIDIA is looking for an experienced HPC DevOps and Network Engineer to help build the supercomputers and HPC clusters of the future. The Senior HPC DevOps Engineer will play a key role in driving advancements in artificial intelligence and GPU computing by designing and maintaining large-scale HPC/AI clusters and developing automation tools for deployment and operation.

Responsibilities:

Innovate and Implement: Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems
Infrastructure as Code (IaC): Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments
Streamline CI/CD Pipelines: Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes
Automate Everything: Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring
Develop complex Networking automations
Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency
Lead and Educate: Serve as a technical resource, developing and sharing best practices with internal teams
Drive Innovation: Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements

Requirements:

B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience
Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software
Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles
Familiarity with Jenkins, Ansible, Puppet/Chef
Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security
Deep understanding of networking protocols such as InfiniBand and Ethernet
Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes
Background with multiple storage solutions like Lustre, GPFS, ZFS, and XFS
Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix)
Familiarity with cloud platforms (AWS, Azure, Google Cloud)
Proven networking experience or strong knowledge through professional networking training
Architectural Insight: Knowledge of CPU and/or GPU architecture
Container Expertise: Understanding of Kubernetes and container-related microservice technologies
GPU Focus: Experience with GPU-focused hardware/software (DGX, CUDA)
RDMA Fabrics: Background with RDMA (InfiniBand or RoCE) fabrics

Senior HPC DevOps Engineer

Key skills

About this role

Responsibilities:

Requirements: