Roblox is a platform that empowers users to create and connect in immersive digital experiences. As a Principal Software Engineer on the Compute team, you will lead the GPU and AI accelerator capabilities, ensuring reliability and performance across a fleet of accelerators while collaborating with various engineering teams to drive GPU strategy.
Responsibilities:
- Serve as the GPU technical leader for the Compute team, partnering across Kubernetes, Machine Bootstrap, Networking, and Cloud to drive GPU strategy end to end
- Own the GPU host lifecycle above raw fleet management: driver, firmware, and CUDA stack management, GPU health and telemetry, and remediation of GPU-specific failures (XID errors, ECC, thermal, NVLink and fabric faults)
- Architect how GPU capacity is exposed to compute platforms, including scheduling, isolation, and integration with Kubernetes for GPU and AI workloads
- Drive GPU reliability and performance at fleet scale, defining the detection, diagnosis, and automated repair of unhealthy accelerators before they impact production
- Evaluate and onboard new GPU and AI accelerator platforms, networking topologies (NVLink, InfiniBand, RoCE), and multi-node training and inference patterns
- Establish the standards, tooling, and APIs that let other engineering teams consume GPU compute safely and efficiently, reducing toil and raising the bar for the org