Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence. They are seeking a deeply technical Member of Technical Staff to own distributed training infrastructure for large-scale omni model pretraining, focusing on building and scaling the training stack for complex AI models.
Responsibilities:
- Own the distributed training stack for omni model pretraining, from 0→1 system design to 1→10 scaling across large GPU clusters
- Build and operate the core training runtime: job orchestration, distributed execution, checkpointing, recovery, monitoring, and debugging for long-running training jobs
- Optimize large-scale training performance across parallelism strategy, GPU communication, memory usage, data throughput, MFU, step time, and end-to-end training efficiency
- Build infrastructure for omni training workloads: high-throughput audio/video/text data loading, temporal alignment, variable sequence handling, multimodal synchronization, and memory-efficient training
- Evolve the platform as model architectures, training recipes, data mixtures, sequence lengths, hardware constraints, and research directions change