Mind Robotics is building generalized physical AI to create robotic systems capable of adaptive work in industrial environments. They are seeking a Machine Learning Infrastructure Engineer to develop core systems that facilitate fast and scalable model training.

Responsibilities:

Design and implement scalable systems for training large ML models
Enable efficient workflows for data ingestion, training, and iteration
Develop and optimize distributed training systems across hundreds of GPUs
Implement strategies for parallelization, sharding, and efficient compute utilization
Improve training efficiency through techniques such as attention optimizations, kernel fusion, and memory management
Partner closely with modeling teams to accelerate iteration speed and reduce training costs
Build internal tools for experiment tracking, monitoring, and debugging
Implement systems for tracking training performance, failures, and resource utilization
Debug and resolve bottlenecks across the training stack
Provide lightweight infrastructure support for deploying and running models in production environments
Optimize inference performance and reliability where needed
Support core cloud infrastructure needs for training workloads (without heavy DevOps overhead)
Manage compute resources efficiently across training jobs

Machine Learning Infrastructure Engineer

Key skills

About this role

Responsibilities: