Baseten is a company that powers mission-critical inference for leading AI companies by providing flexible infrastructure and developer tooling. They are seeking a Software Engineer to architect and develop their training platform, supporting research engineers and model developers while making key technical decisions for the infrastructure.
Responsibilities:
- Design and architect scalable infrastructure systems for our ML training platform (e.g. scheduling, storage, and networking)
- Partner closely with developers and research engineers to translate complex training requirements into technical solutions
- Design and architect a global training scheduler
- Design and architect reinforcement learning systems and continuous learning pipelines
- Drive long-term improvements to improve reliability of systems and velocity of development
- Partner closely with SRE and Capacity teams to unlock state of the art training infrastructure
- Make critical architectural decisions balancing performance with system reliability
- Lead technical discussions and mentor junior engineers on infrastructure best practices
- Contribute to long-term technical strategy and infrastructure roadmap
Requirements:
- Bachelor's degree or high in Computer Science or related field
- Proficiency in Go, with Python experience a plus
- Deep expertise with Kubernetes in production environments
- Extensive experience with major cloud providers (AWS, GCP) and neo-cloud providers (Crusoe, DigitalOcean, Nebius) a plus
- Advanced understanding of distributed systems concepts and performance tuning
- Proven experience designing observability systems
- Experience with ML/AI workloads and MLOps platforms highly valued
- Experience with distributed storage systems
- Experience with workload orchestration platforms like Temporal or Airflow
- Familiarity or experience with the open source training stack and frameworks (NCCL, PyTorch, Megatron, NemoRL, VeRL, Axolotl, HF Trainier) and distributed training techniques (FSDP, DeepSpeed)
- Experience developing AI products, tooling, or agents