Own and maintain our internal benchmark suite, covering single/multi-turn content guardrails and agentic safety.
Build benchmarks that distinguish specific model capabilities.
Work with the product team to build evals covering core functionality of our flagship models.
Build benchmarks for new features coming out of the research team.
Adapt and extend evals to new verticals and changing product data.
Work on research projects that study and quantify realistic agentic and LLM failure modes in the wild.
Requirements
Have built an LLM benchmark from scratch that distinguished specific model capabilities (i.e., produced a measurable, defensible capability difference, not just a score).
Have built synthetic data for post-training textual or multimodal models.
Can reproduce a published benchmark result and identify where the original methodology is fragile or misleading.
You write Python that other people can build on. Our whole stack is Python; we want someone who has shipped and maintained production code and who factors messy problems into clean abstractions others can extend.
You can write efficient LLM inference setups, including sensible orchestration of parallel calls, retries, rate-limit handling.
An AI power-user — fluent with frontier models and coding agents day to day.
A big plus: Automated red-teaming experience, Have worked across a range of agentic scaffolds and reproduced public benchmark results on them, Strong knowledge of existing reward-model / monitoring / safety benchmarks, One or more published papers in the evals / safety-evaluation space
Tech Stack
Python
Benefits
Paid time off in line with your local regulations, no matter where you work from.
Comprehensive medical insurance for our France-based team
All the hardware, tools, and services you need
Covered subscriptions for AI agents and IDEs
Team off-sites twice a year: we’ve recently been to the Alps and to Saint-Tropez