Lightning AI is the company behind PyTorch Lightning, focused on building an end-to-end platform for AI systems. They are seeking a Lead Research Engineer to optimize training and inference workloads on their infrastructure, driving improvements across models and systems for real-world AI applications.
Responsibilities:
- Lead optimization efforts for large-scale training and inference workloads across GPUs, accelerators, and distributed systems
- Partner directly with customers to analyze workloads, identify bottlenecks, and drive improvements in performance, scalability, and reliability of deployed AI systems
- Architect and improve inference pipelines, model serving systems, and performance-oriented tooling for production AI workloads
- Lead the design and implementation of profiling, debugging, and observability tools to analyze model execution and guide optimization strategies
- Drive performance improvements across the software stack through clean APIs, automation, and seamless integration with the Lightning ecosystem
- Collaborate cross-functionally with infrastructure, product, and research teams to shape technical direction and improve the developer and user experience for AI workloads running on Lightning
- Partner with hardware vendors and ecosystem partners to support efficient execution across diverse compute backends (NVIDIA, TPU, and emerging accelerators)
- Contribute technical leadership to open-source projects through new features, tooling improvements, documentation, and community engagement
- Stay current with advancements in large-scale inference, distributed training, and ML systems optimization, and help guide adoption of new technologies and approaches