Reddit is a community-driven platform known for its open conversations and vast user base. They are seeking a Staff Machine Learning Engineer to lead the development of a large-scale ML Inference Platform that supports various teams and enhances user experiences through advanced AI systems.
Responsibilities:
- Lead the end-to-end design, implementation, and maintenance of a highly available, low-latency GPU-based model serving system for search, ranking, and LLMs supporting Millions of QPS
- Design and develop ML and Generative AI systems in cloud-based production environments on Kubernetes at scale
- Rapidly develop prototypes and develop a high-performance feature hydration and processing system as a part of the inference stack - including routing, caching, and batching
- Lead a unified GPU model export framework to support converting trained models into optimized GPU inference models
- Strong understanding of real-time ML observability to track feature/model performance
- Experience working with LLM serving online at scale
- Built an E2E inference performance benchmarking framework
- Deep Understanding of multi-cluster compute environment and network topology that is specific to ML inference use cases
Requirements:
- 7+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles
- Experience operating orchestration systems such as Kubernetes at scale
- Deep experience with cloud-based technologies for supporting an ML platform, including tools like AWS, Google Cloud Storage, infrastructure-as-code (Terraform), and more
- Proficiency with the common programming languages and frameworks of ML, such as Go, Python, etc
- Excellent communication skills with the ability to articulate technical AI concepts to non-technical stakeholders
- Strong focus on scalability, reliability, performance, and ease of use
- Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems
- Strong proficiency in Python and deep experience with modern AI/ML frameworks (Triton, Dynamo, vLLM, Pytorch)
- Strong understanding of real-time ML observability to track feature/model performance
- Experience working with LLM serving online at scale
- Built an E2E inference performance benchmarking framework
- Deep Understanding of multi-cluster compute environment and network topology that is specific to ML inference use cases