Torc Robotics is a leader in autonomous driving technology, focusing on developing software for automated trucks. They are seeking a Senior Autonomy Data Engineer to design and operate the data infrastructure that supports their autonomy program, ensuring reliable data pipelines and effective collaboration with cross-functional teams.
Responsibilities:
- Own the design and organization of the program’s data lake, including schema definitions, partitioning strategy and metadata indexing
- Design and maintain end-to-end pipelines that ingest high-bandwidth sensor logs from vehicles into cloud storage with high reliability and tolerant of ad-hoc and intermittent connectivity mechanisms
- Develop data validation and integrity checks that can detect corrupted information, missing sensors, and inconsistent calibration prior to the data being processed by downstream systems
- Implement retention, tiering and lifecycle policies for data to balance storage costs with development value
- Build tooling to query raw logs to produce curated training and evaluation datasets
- Build automation to run cost-effective pseudo-labeling workflows at the scale of data ingest
- Implement data quality and model performance metrics that are used to direct labeling effort toward the highest-value examples
- Deploy and maintain data visualization tooling to support log review, annotation QA, and autonomy debugging workflows
- Build integrations between the visualization tooling and the data lake so engineers can navigate from a dataset entry or model failure directly to the origin log data
- Work with autonomy engineers to define and surface custom visualization panels and implement metrics for analyzing unstructured operating environments
- Build dashboards that provide the autonomy engineers visibility into data coverage by terrain type, operating environment and geographic region
- Establish and document data contracts between the data services and model training consumers
- Partner with perception, planning and embedded engineers across the data lifecyle: from shaping the logging schemas and collection triggers to defining the dataset interfaces that supply model training and evaluation
- Define data engineering standards, best practices, and tooling choices for an innovative and fast-paced team
- Contribute to the data roadmap and provide input to technical leadership on investment priorities
- Mentor junior engineers and raise the team’s capabilities in data infrastructure scalability and operational hygiene
Requirements:
- Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Electrical Engineering or a related field with 6+ years of data engineering experience or a Master's with 4+ years
- Strong proficiency in Python and SQL, with demonstrated ability to build production-quality data pipelines
- Deep experience with cloud data infrastructure (AWS preferred: S3, Glue Athena, redshift, or equivalent) and infrastructure-as-code tools (Terraform, Cloud Formation)
- Solid understanding of data partitioning strategies and columnar storage formats (Parquet, Orc, etc.)
- Experience building and operating data pipelines that process time-series and binary data
- Proven ability to evaluate and integrate open-source tooling when appropriate versus building from scratch
- Strong instincts for delivering data quality through first-class implementations of monitoring, validation and lineage tracking
- Experience with autonomous vehicles, robotics, or other sensor-driven autonomous systems
- Deep experience with Foxglove or Rerun beyond basic playback, e.g. building custom extensions or integrating them into a structured log review or annotation QA workflow
- Familiarity with the MCAP CLI and/or python library and experience converting MCAP data to columnar data formats for further querying and processing
- Experience with data curation for ML training, e.g. diversity sampling, pseudo-labeling, and dataset versioning