TikTok is the leading destination for short-form mobile video, and they are seeking a Research Scientist to focus on privacy-preserving large-scale model training and architecture optimization. The role involves designing and optimizing training architectures for generative models while ensuring that privacy is prioritized in technology innovation.
Responsibilities:
- Design and optimize large-scale training architectures for diffusion-based and unified generative models (e.g., DiT, Rectified Flow, hybrid AR + diffusion systems)
- Lead GPU-centric performance optimization, including memory layout, communication overlap, kernel fusion, and throughput scaling across thousands of accelerators
- Develop and evolve distributed training strategies (DP / TP / PP / ZeRO / FSDP-style sharding) tailored to long-running, multi-stage foundation model training
- Build fault-tolerant, self-healing training systems that can sustain long-running jobs under frequent hardware, network, and software failures
- Design mechanisms for fast failure detection, recovery, and minimal training interruption, including checkpointing strategies, restart policies, and controlled rollouts
- Improve training ETTR / MFU / utilization efficiency under real-world production constraints
- Optimize Diffusion Transformer training pipelines, including noise schedules, timestep strategies, and memory-efficient attention mechanisms
- Support unified generation-and-understanding models, enabling shared context, long-sequence multimodal reasoning, and scalable training without architectural bottlenecks
- Collaborate with research teams on architecture-level tradeoffs between quality, compute efficiency, and training stability