Vero is an exciting AI infrastructure startup that collaborates closely with NVIDIA and other key organizations. They are seeking a Senior GPU DevOps Engineer to design, deploy, and manage pipelines and infrastructure for large-scale NVIDIA GPU platforms, focusing on optimizing performance and resource utilization in Kubernetes environments.

Responsibilities:

Design, deploy and manage DevOps pipelines supporting large-scale GPU infrastructure and distributed AI/ML and HPC workloads for customers
Automate provisioning, monitoring and maintenance of high-performance GPU environments
Optimize performance, stability and resource utilization across Kubernetes clusters in a liquid-cooled data center environment
Work closely with infrastructure, hardware and software teams to integrate compute, networking and cooling systems
Monitor system health and build tooling to track temperature, pressure and power usage across high-density environments
Troubleshoot deployment, scaling and performance issues across distributed GPU systems
Implement CI/CD pipelines, infrastructure-as-code and security best practices to support reliable deployments at scale
Develop automation and tooling using Python and Bash

Requirements:

GPU & HPC systems experience - large scale customer-facing environments
5+ years experience in DevOps, SRE, platform engineering or infrastructure roles
Kubernetes & Slurm
Terraform, Ansible, Bash or Python
CI/CD pipelines for large-scale distributed systems
Monitoring and Observability tools such as Prometheus, Grafana or Redfish
Comfortable operating in high-availability, uptime-critical environments
AI Enabled Storage platform experience (e.g VAST, WEKA, DDN)
Elite debugger & low-level HPC networking troubleshooter

Senior GPU DevOps Engineer

Key skills

About this role

Responsibilities:

Requirements: