About this role

RadixArk is an infrastructure-first AI company seeking a Member of Technical Staff — Supercomputing to build and operate production-grade AI infrastructure. The role involves deploying and managing AI workloads, ensuring system reliability, and collaborating with engineering teams to enhance deployment processes.

Responsibilities:

Deploy SGLang, Miles, and RadixArk infrastructure across customer, cloud, VPC, and dedicated cluster environments
Bring up production inference and training workloads for open-weight and customer-specific models
Own deployment reliability, environment management, rollout processes, and production validation
Debug issues across LLM serving, Kubernetes, networking, GPU infrastructure, storage, cloud capacity, and customer systems
Build and improve observability for latency, throughput, uptime, error rates, GPU utilization, memory usage, capacity, and workload health
Improve monitoring, alerting, incident response, runbooks, postmortems, and operational processes
Help design capacity planning, autoscaling, and reliability strategies for GPU-intensive workloads
Work closely with engineering teams to improve deployment tooling, automation, CI/CD, and production readiness
Partner with customer engineering teams during POCs, production launches, and ongoing operations
Feed deployment and reliability pain points back into the product and engineering roadmap
Help build the foundation for a world-class supercomputing deployment and reliability organization

Member of Technical Staff — Supercomputing

Key skills

About this role

Responsibilities: