RadixArk is an infrastructure-first company focused on democratizing frontier-level AI infrastructure. They are seeking a Member of Technical Staff to architect and scale core compute platforms for AI training and inference, focusing on systems engineering across cluster architecture, networking, and performance optimization.
Responsibilities:
- Architect and scale large AI compute clusters for training and inference
- Design cluster management, scheduling, and resource allocation systems
- Optimize performance, utilization, and reliability of GPU/TPU clusters
- Improve fault tolerance and system resilience at scale
- Drive observability, monitoring, and performance profiling for cluster infrastructure
- Collaborate with ML and systems engineers to support frontier AI workloads
- Lead capacity planning and infrastructure scaling strategies
- Build internal platforms and tooling to improve developer productivity
- Document architecture, operational practices, and reliability strategies
- Contribute to long-term platform vision and technical direction