About this role

Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence. They are seeking an early-career engineer to optimize model inference for real-time conversations, focusing on reducing latency and enhancing performance across their AI systems.

Responsibilities:

Contribute to end-to-end inference optimization across our model stack — LLMs, audio models, and diffusion-based components
Implement and tune KV cache strategies for long-context conversations, including eviction policies, compression, and memory-efficient attention
Work with inference serving frameworks (vLLM, SGLang, TensorRT-LLM, etc.) and extend them for our specific workloads
Profile and benchmark end-to-end latency and throughput; identify and systematically eliminate bottlenecks
Build internal tooling that makes optimization work faster and more rigorous — profiling viewers, end-to-end inference test harnesses, and other infrastructure that helps the team move quickly
Accelerate diffusion model inference — consistency models, step distillation, caching strategies, and custom kernel optimizations
Apply quantization techniques (INT8, INT4, GPTQ, AWQ, and beyond) to reduce memory footprint and increase throughput without meaningfully degrading quality
Work closely with research and infrastructure to ensure new models ship with optimized serving from day one

Member of Technical Staff — Model Optimization and Inference (New Grad)

Key skills

About this role

Responsibilities: