Mind Robotics is building generalized physical AI to create robotic systems capable of adaptive work in industrial environments. They are seeking a Machine Learning Infrastructure Engineer to develop core systems that facilitate fast and scalable model training.
Responsibilities:
- Design and implement scalable systems for training large ML models
- Enable efficient workflows for data ingestion, training, and iteration
- Develop and optimize distributed training systems across hundreds of GPUs
- Implement strategies for parallelization, sharding, and efficient compute utilization
- Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management
- Partner closely with modeling teams to accelerate iteration speed and reduce training costs
- Build internal tools for experiment tracking, monitoring, and debugging
- Implement systems for tracking training performance, failures, and resource utilization
- Debug and resolve bottlenecks across the training stack
- Provide lightweight infrastructure support for deploying and running models in production environments
- Optimize inference performance and reliability where needed
- Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)
- Manage compute resources efficiently across training jobs