General Motors is seeking an experienced Staff Machine Learning Engineer specializing in ML Training Infrastructure. The role involves leading the design and development of scalable AI/ML platforms to support advanced AI research and model development, as well as collaborating with cross-functional teams to drive technical initiatives.
Responsibilities:
- Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
- Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
- Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
- Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
- Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
- Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
- Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team
Requirements:
- Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
- 7+ years of professional software engineering experience
- 5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
- Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
- Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
- Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
- Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
- Willingness to travel to Sunnyvale, CA as needed
- Comfortable operating in highly ambiguous and dynamic environments
- Deep expertise in PyTorch 2.x+ and distributed training frameworks
- Experience designing and developing training platforms that support FSDP, pipeline parallelism, and other scalable solutions for training large foundational models
- Experience profiling, analyzing, debugging, and optimizing training and data loading performance at scale
- Strong record of technical leadership through architecture reviews, roadmap influence, and cross-team execution
- Excellent communication skills, with the ability to build consensus, navigate controversial decisions, communicate risks clearly, and provide constructive technical feedback
- Self-motivated, execution-oriented, and motivated by delivering broad organizational impact