General Motors is seeking an experienced Staff Machine Learning Engineer specializing in ML Training Infrastructure. The role involves leading the design and development of scalable AI/ML platforms to support advanced AI research and model development, as well as collaborating with cross-functional teams to drive technical initiatives.

Responsibilities:

Define and drive the architecture, design, and development of scalable, reliable, and high-performance ML frameworks and platform capabilities to support model training at scale
Lead model training performance analysis and optimization efforts across distributed training workflows, improving scalability, efficiency, and cost across heterogeneous hardware environments
Raise the bar on system observability, debuggability, operational excellence, and developer experience across the ML training stack
Own large, ambiguous, cross-functional technical initiatives from strategy through execution, including technical roadmap definition, tradeoff analysis, and delivery
Influence platform direction by identifying long-term infrastructure investments, setting engineering standards, and driving adoption of best practices across teams
Collaborate across organizational boundaries to align requirements, resolve technical disagreements, and integrate new capabilities into the platform ecosystem
Mentor engineers through design reviews, technical guidance, and hands-on partnership, while elevating engineering quality across the team

Requirements:

Bachelor's degree or higher in Computer Science or a related field, or equivalent practical experience
7+ years of professional software engineering experience
5+ years of specialized experience in AI/ML infrastructure, such as enabling distributed training for large-scale ML models
Strong programming skills in Python, with deep proficiency in frameworks such as PyTorch (preferred), TensorFlow, or similar ML systems
Proven experience designing and operating distributed systems for ML training, including distributed computing, GPU computing, and cloud environments (AWS, GCP, Azure)
Demonstrated track record of leading technically ambiguous, cross-team infrastructure initiatives and driving them to measurable impact
Strong architectural judgment and ability to make sound technical tradeoffs across performance, reliability, usability, and cost
Willingness to travel to Sunnyvale, CA as needed
Comfortable operating in highly ambiguous and dynamic environments
Deep expertise in PyTorch 2.x+ and distributed training frameworks
Experience designing and developing training platforms that support FSDP, pipeline parallelism, and other scalable solutions for training large foundational models
Experience profiling, analyzing, debugging, and optimizing training and data loading performance at scale
Strong record of technical leadership through architecture reviews, roadmap influence, and cross-team execution
Excellent communication skills, with the ability to build consensus, navigate controversial decisions, communicate risks clearly, and provide constructive technical feedback
Self-motivated, execution-oriented, and motivated by delivering broad organizational impact

Staff Machine Learning Engineer - ML Training Infrastructure

Key skills

About this role

Responsibilities:

Requirements: