Thinking Machines Lab is dedicated to advancing collaborative general intelligence through innovative AI solutions. The role of Research Engineer in Infrastructure focuses on designing and building core systems that facilitate scalable and efficient training of large models, ensuring that research teams can concentrate on scientific advancements without system bottlenecks.

Responsibilities:

Design, implement, and optimize distributed training systems that scale across thousands of GPUs and nodes for large-scale training workloads
Develop high-performance optimizations to maximize throughput and efficiency
Develop reusable frameworks and libraries to improve training reproducibility, reliability, and scalability for new model architectures
Establish standards for reliability, maintainability, and security, ensuring systems are robust under rapid iteration
Collaborate with researchers and engineers to build scalable infrastructure
Publish and share learnings through internal documentation, open-source libraries, or technical reports that advance the field of scalable AI infrastructure

Research Engineer, Infrastructure, Training Systems

Key skills

About this role

Responsibilities: