Cartesia is on a mission to architect AI that learns from and interacts with the world like humans do. The Tech Lead Manager, Data Infrastructure will own the strategy and execution for all data at Cartesia, leading a team to acquire, process, and curate multimodal datasets that power their cutting-edge research.
Responsibilities:
- Define Cartesia's multi-modal data strategy across pre-training and post-training, spanning human, synthetic, and web-scale sources, with particular depth in audio
- Lead, mentor, and eventually manage a team of engineers building dataset and ML data infrastructure
- Design and operate scalable, high-throughput data pipelines for text, audio, and video — covering ingestion, preprocessing, augmentation, dataset versioning, and data loading for training
- Partner closely with research and inference teams so data systems are co-designed with training and serving infrastructure (batching, GPU-aware loading, evaluation pipelines)
- Establish and enforce rigorous standards for data quality, with a tight feedback loop between dataset characteristics and model behavior
- Identify and source novel datasets; manage relationships and budgets with external data vendors and partners