TrueFoundry is a company building the foundational infrastructure for production AI systems. They are seeking a Staff ML Platform Engineer to develop and optimize large-scale ML models and ensure high-performance inference pipelines.
Responsibilities:
- Write clean, modular, and scalable Python code, with a strong emphasis on reliability and performance
- Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools
- Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models
- Build platform for developing, deploying and evaluating agentic applications for our end customers
- Help shape internal standards and best practices across the engineering team for high-scale ML workloads