LawZero is a non-profit organization committed to advancing research and creating technical solutions for safe-by-design AI systems. The Senior ML Data Processing Developer will be responsible for developing and managing data pipelines that transform raw data into training datasets for AI, ensuring data quality and compliance throughout the process.
Responsibilities:
- Partner with the Research team to define, build, automate, scale, and manage data pipelines that transform raw web-scale data into training datasets for the Scientist AI
- Build and maintain data processing pipelines, including deduplication, model-based quality scoring, heuristic filtering, toxicity removal, PII scrubbing, metadata extraction, and proprietary data transformations, with full dataset versioning and provenance tracking, optimizing for throughput and cost at scale
- Ensure all ingested data meets compliance requirements, internal Data Governance policies, and legal obligations
- Develop and refine the scoring and filtering toolchain: heuristics, LLM-as-a-judge evaluators, ML classifiers, metadata extraction modules, and human-in-the-loop review workflows required for data processing and quality assurance
- Instrument data processing pipelines with data-quality monitoring, guardrails, and alerting to catch regressions before they propagate downstream
- Collaborate with the Research team and other teams to understand evolving data requirements, then identify and acquire large-scale text corpora that meet those requirements. This includes conducting systematic coverage analyses to identify gaps in the corpus and develop targeted acquisition strategies to address them, and working with the Legal & Governance Team to license new data sources
- Design and maintain strict leakage detection mechanisms to guard against evaluation contamination across all stages of the data processing pipeline
- Build internal tooling and interfaces that let researchers explore, query, and understand available datasets with minimal friction