Reddit, Inc. is a community-driven platform that hosts open conversations on the internet, and they are seeking a Senior Machine Learning Systems Engineer to enhance their Ads ML Experience Platform. The role involves designing and building large-scale ML experimentation platforms, developing production-grade training orchestration frameworks, and collaborating with ML engineers to improve operational efficiency.

Responsibilities:

Design and build large-scale offline ML experimentation platforms that enable reproducible research, model development, evaluation, and promotion workflows
Develop production-grade training orchestration frameworks supporting distributed training, hyperparameter optimization, model evaluation, and automated retraining
Build infrastructure for experiment tracking, metadata management, lineage, artifact versioning, model registries, and reproducibility
Partner with ML engineers and researchers to improve experimentation velocity and operational efficiency
Build automated workflows for model promotion, rollback, compliance validation, and continuous evaluation
Design and build an agentic AI execution platform supporting autonomous and human-in-the-loop workflows, including multi-agent orchestration, memory/context systems, and scalable workflow infrastructure

Requirements:

5+ years in infrastructure/platform engineering or large-scale distributed systems
2+ years of hands-on experience building and operating production ML infrastructure, developer SDKs, platform APIs, or self-service AI tooling
Experience building workflow orchestration systems, developer platforms, or large-scale automation frameworks
Experience with distributed data processing systems such as Spark, Flink, Ray, or equivalent technologies
Experience with modern orchestration and workflow technologies such as Kubeflow, Argo, Airflow, or similar frameworks
Experience building offline ML experimentation platforms, model registries, experiment tracking systems, or training orchestration frameworks
Experience building and operating agentic AI systems, including multi-agent orchestration, autonomous workflows, and agent communication/runtime frameworks (e.g., MCP, A2A, and orchestration systems) is a strong plus
Experience running end-to-end model development and iteration cycles at scale is a plus

Senior Machine Learning Systems Engineer, Ads ML Experience Platform

Key skills

About this role

Responsibilities:

Requirements: