Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence. They are seeking an early-career engineer to optimize model inference for real-time conversations, focusing on reducing latency and enhancing performance across their AI systems.
Responsibilities:
- Contribute to end-to-end inference optimization across our model stack — LLMs, audio models, and diffusion-based components
- Implement and tune KV cache strategies for long-context conversations, including eviction policies, compression, and memory-efficient attention
- Work with inference serving frameworks (vLLM, SGLang, TensorRT-LLM, etc.) and extend them for our specific workloads
- Profile and benchmark end-to-end latency and throughput; identify and systematically eliminate bottlenecks
- Build internal tooling that makes optimization work faster and more rigorous — profiling viewers, end-to-end inference test harnesses, and other infrastructure that helps the team move quickly
- Accelerate diffusion model inference — consistency models, step distillation, caching strategies, and custom kernel optimizations
- Apply quantization techniques (INT8, INT4, GPTQ, AWQ, and beyond) to reduce memory footprint and increase throughput without meaningfully degrading quality
- Work closely with research and infrastructure to ensure new models ship with optimized serving from day one