Nscale is a GPU cloud engineered for AI, providing high-performance infrastructure for AI-native startups and global enterprises. They are seeking a Senior Site Reliability Engineer to design, build, and operate reliable, scalable infrastructure across their GPU cloud, focusing on hands-on engineering and operational excellence.

Responsibilities:

Design, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloads
Contribute to the development of control-plane systems and operational frameworks
Define and implement SLOs, SLIs, and monitoring strategies to ensure system reliability
Participate in incident response and root cause analysis, driving improvements to reduce recurrence
Identify and address reliability and performance bottlenecks across systems
Collaborate with Engineering, Network, and Fleet teams to improve system design and operational processes
Drive improvements in availability, scalability, and operational efficiency
Mentor junior engineers and contribute to a strong engineering and reliability culture

Requirements:

5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environments
Strong software engineering skills with experience building automation and infrastructure tooling
Solid understanding of Linux systems, networking, and distributed systems
Experience troubleshooting issues across infrastructure, OS, networking, and application layers
Familiarity with monitoring, alerting, and observability tools
Ability to balance reliability, performance, and delivery speed
Experience with AI or HPC environments, including GPUs or high-performance systems
Exposure to high-speed networking (InfiniBand/RDMA)
Familiarity with Kubernetes, cloud platforms, or bare-metal environments
Experience with observability systems in high-scale environments

Senior Site Reliability Engineer -AI Infrastructure Operations

Key skills

About this role

Responsibilities:

Requirements: