RadixArk is an infrastructure-first AI company seeking a Member of Technical Staff — Supercomputing to build and operate production-grade AI infrastructure. The role involves deploying and managing AI workloads, ensuring system reliability, and collaborating with engineering teams to enhance deployment processes.
Responsibilities:
- Deploy SGLang, Miles, and RadixArk infrastructure across customer, cloud, VPC, and dedicated cluster environments
- Bring up production inference and training workloads for open-weight and customer-specific models
- Own deployment reliability, environment management, rollout processes, and production validation
- Debug issues across LLM serving, Kubernetes, networking, GPU infrastructure, storage, cloud capacity, and customer systems
- Build and improve observability for latency, throughput, uptime, error rates, GPU utilization, memory usage, capacity, and workload health
- Improve monitoring, alerting, incident response, runbooks, postmortems, and operational processes
- Help design capacity planning, autoscaling, and reliability strategies for GPU-intensive workloads
- Work closely with engineering teams to improve deployment tooling, automation, CI/CD, and production readiness
- Partner with customer engineering teams during POCs, production launches, and ongoing operations
- Feed deployment and reliability pain points back into the product and engineering roadmap
- Help build the foundation for a world-class supercomputing deployment and reliability organization