Vero is an exciting AI infrastructure startup that collaborates closely with NVIDIA and other key organizations. They are seeking a Senior GPU DevOps Engineer to design, deploy, and manage pipelines and infrastructure for large-scale NVIDIA GPU platforms, focusing on optimizing performance and resource utilization in Kubernetes environments.
Responsibilities:
- Design, deploy and manage DevOps pipelines supporting large-scale GPU infrastructure and distributed AI/ML and HPC workloads for customers
- Automate provisioning, monitoring and maintenance of high-performance GPU environments
- Optimize performance, stability and resource utilization across Kubernetes clusters in a liquid-cooled data center environment
- Work closely with infrastructure, hardware and software teams to integrate compute, networking and cooling systems
- Monitor system health and build tooling to track temperature, pressure and power usage across high-density environments
- Troubleshoot deployment, scaling and performance issues across distributed GPU systems
- Implement CI/CD pipelines, infrastructure-as-code and security best practices to support reliable deployments at scale
- Develop automation and tooling using Python and Bash
Requirements:
- GPU & HPC systems experience - large scale customer-facing environments
- 5+ years experience in DevOps, SRE, platform engineering or infrastructure roles
- Kubernetes & Slurm
- Terraform, Ansible, Bash or Python
- CI/CD pipelines for large-scale distributed systems
- Monitoring and Observability tools such as Prometheus, Grafana or Redfish
- Comfortable operating in high-availability, uptime-critical environments
- AI Enabled Storage platform experience (e.g VAST, WEKA, DDN)
- Elite debugger & low-level HPC networking troubleshooter