Together AI is a research-driven artificial intelligence company focused on optimizing AI systems. The role involves designing and developing distributed inference engines for large language models, emphasizing performance and scalability.

Responsibilities:

Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models
Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving
Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability
Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators
Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines

LLM Inference Frameworks and Optimization Engineer

Key skills

About this role

Responsibilities: