Lightning AI is the company behind PyTorch Lightning, focused on building an end-to-end platform for AI systems. They are seeking a Lead Research Engineer to optimize training and inference workloads on their infrastructure, driving improvements across models and systems for real-world AI applications.

Responsibilities:

Lead optimization efforts for large-scale training and inference workloads across GPUs, accelerators, and distributed systems
Partner directly with customers to analyze workloads, identify bottlenecks, and drive improvements in performance, scalability, and reliability of deployed AI systems
Architect and improve inference pipelines, model serving systems, and performance-oriented tooling for production AI workloads
Lead the design and implementation of profiling, debugging, and observability tools to analyze model execution and guide optimization strategies
Drive performance improvements across the software stack through clean APIs, automation, and seamless integration with the Lightning ecosystem
Collaborate cross-functionally with infrastructure, product, and research teams to shape technical direction and improve the developer and user experience for AI workloads running on Lightning
Partner with hardware vendors and ecosystem partners to support efficient execution across diverse compute backends (NVIDIA, TPU, and emerging accelerators)
Contribute technical leadership to open-source projects through new features, tooling improvements, documentation, and community engagement
Stay current with advancements in large-scale inference, distributed training, and ML systems optimization, and help guide adoption of new technologies and approaches

Lead Research Engineer

Key skills

About this role

Responsibilities: