Reddit, Inc. is a community-driven platform that hosts open conversations on the internet, and they are seeking a Senior Machine Learning Systems Engineer to enhance their Ads ML Experience Platform. The role involves designing and building large-scale ML experimentation platforms, developing production-grade training orchestration frameworks, and collaborating with ML engineers to improve operational efficiency.
Responsibilities:
- Design and build large-scale offline ML experimentation platforms that enable reproducible research, model development, evaluation, and promotion workflows
- Develop production-grade training orchestration frameworks supporting distributed training, hyperparameter optimization, model evaluation, and automated retraining
- Build infrastructure for experiment tracking, metadata management, lineage, artifact versioning, model registries, and reproducibility
- Partner with ML engineers and researchers to improve experimentation velocity and operational efficiency
- Build automated workflows for model promotion, rollback, compliance validation, and continuous evaluation
- Design and build an agentic AI execution platform supporting autonomous and human-in-the-loop workflows, including multi-agent orchestration, memory/context systems, and scalable workflow infrastructure
Requirements:
- 5+ years in infrastructure/platform engineering or large-scale distributed systems
- 2+ years of hands-on experience building and operating production ML infrastructure, developer SDKs, platform APIs, or self-service AI tooling
- Experience building workflow orchestration systems, developer platforms, or large-scale automation frameworks
- Experience with distributed data processing systems such as Spark, Flink, Ray, or equivalent technologies
- Experience with modern orchestration and workflow technologies such as Kubeflow, Argo, Airflow, or similar frameworks
- Experience building offline ML experimentation platforms, model registries, experiment tracking systems, or training orchestration frameworks
- Experience building and operating agentic AI systems, including multi-agent orchestration, autonomous workflows, and agent communication/runtime frameworks (e.g., MCP, A2A, and orchestration systems) is a strong plus
- Experience running end-to-end model development and iteration cycles at scale is a plus