Together AI is a research-driven artificial intelligence company focused on optimizing AI systems. The role involves designing and developing distributed inference engines for large language models, emphasizing performance and scalability.
Responsibilities:
- Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models
- Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving
- Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability
- Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators
- Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines