TrueFoundry is a company building the foundational infrastructure for production AI systems. They are seeking a Staff ML Platform Engineer to develop and optimize large-scale ML models and ensure high-performance inference pipelines.

Responsibilities:

Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models
Build platform for developing, deploying and evaluating agentic applications for our end customers
Help shape internal standards and best practices across the engineering team for high-scale ML workloads

Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)

Key skills

About this role

Responsibilities: