Nscale is a GPU cloud engineered for AI, providing high-performance infrastructure for AI-native startups and global enterprises. They are seeking a Senior Site Reliability Engineer to design, build, and operate reliable, scalable infrastructure across their GPU cloud, focusing on hands-on engineering and operational excellence.
Responsibilities:
- Design, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloads
- Contribute to the development of control-plane systems and operational frameworks
- Define and implement SLOs, SLIs, and monitoring strategies to ensure system reliability
- Participate in incident response and root cause analysis, driving improvements to reduce recurrence
- Identify and address reliability and performance bottlenecks across systems
- Collaborate with Engineering, Network, and Fleet teams to improve system design and operational processes
- Drive improvements in availability, scalability, and operational efficiency
- Mentor junior engineers and contribute to a strong engineering and reliability culture
Requirements:
- 5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environments
- Strong software engineering skills with experience building automation and infrastructure tooling
- Solid understanding of Linux systems, networking, and distributed systems
- Experience troubleshooting issues across infrastructure, OS, networking, and application layers
- Familiarity with monitoring, alerting, and observability tools
- Ability to balance reliability, performance, and delivery speed
- Experience with AI or HPC environments, including GPUs or high-performance systems
- Exposure to high-speed networking (InfiniBand/RDMA)
- Familiarity with Kubernetes, cloud platforms, or bare-metal environments
- Experience with observability systems in high-scale environments