NVIDIA is looking for an experienced HPC DevOps and Network Engineer to help build the supercomputers and HPC clusters of the future. The Senior HPC DevOps Engineer will play a key role in driving advancements in artificial intelligence and GPU computing by designing and maintaining large-scale HPC/AI clusters and developing automation tools for deployment and operation.
Responsibilities:
- Innovate and Implement: Design, implement, and maintain large-scale HPC/AI clusters with state-of-the-art monitoring, logging, and alerting systems
- Infrastructure as Code (IaC): Utilize and develop tools to manage infrastructure as code, ensuring scalable and repeatable deployments
- Streamline CI/CD Pipelines: Develop and maintain continuous integration and continuous delivery (CI/CD) pipelines to automate and streamline deployment processes
- Automate Everything: Develop automation scripts and tools to automate deployment, configuration management, and operational monitoring
- Develop complex Networking automations
- Troubleshoot Complex Issues: Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency
- Lead and Educate: Serve as a technical resource, developing and sharing best practices with internal teams
- Drive Innovation: Support R&D activities and engage in proof of concepts (POCs) and proof of values (POVs) for future improvements
Requirements:
- B.Sc. in Computer Science, Engineering, or a related field with 5+ years of experience
- Deep knowledge of HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software
- Advanced proficiency in programming and scripting languages, with a solid understanding of object-oriented programming principles
- Familiarity with Jenkins, Ansible, Puppet/Chef
- Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu), networking and OS-level security
- Deep understanding of networking protocols such as InfiniBand and Ethernet
- Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes
- Background with multiple storage solutions like Lustre, GPFS, ZFS, and XFS
- Expertise with virtual systems (VMware, Hyper-V, KVM, Citrix)
- Familiarity with cloud platforms (AWS, Azure, Google Cloud)
- Proven networking experience or strong knowledge through professional networking training
- Architectural Insight: Knowledge of CPU and/or GPU architecture
- Container Expertise: Understanding of Kubernetes and container-related microservice technologies
- GPU Focus: Experience with GPU-focused hardware/software (DGX, CUDA)
- RDMA Fabrics: Background with RDMA (InfiniBand or RoCE) fabrics