Lightning AI is the company behind PyTorch Lightning, focused on building an end-to-end platform for developing, training, and deploying AI systems. They are seeking a highly skilled Research Engineer to optimize training and inference workloads on their infrastructure, working across models and systems to improve performance and scalability.

Responsibilities:

Optimize large-scale training and inference workloads across GPUs, accelerators, and distributed systems
Work directly with customers to analyze workloads, identify bottlenecks, and improve performance, scalability, and reliability of deployed AI systems
Develop and improve inference pipelines, model serving systems, and performance-oriented tooling for production AI workloads
Design and implement profiling, debugging, and observability tools to analyze model execution and guide optimization strategies
Work across the software stack to ensure performance improvements are accessible through clean APIs, automation, and seamless integration with the Lightning ecosystem
Partner with hardware vendors and ecosystem partners to support efficient execution across diverse compute backends (NVIDIA, TPU, and emerging accelerators)
Contribute to open-source projects through new features, tooling improvements, documentation, and community engagement
Stay current with advancements in large-scale inference, distributed training, and ML systems optimization

Research Engineer

Key skills

About this role

Responsibilities: