Lightning AI is the company behind PyTorch Lightning, focused on building an end-to-end platform for developing, training, and deploying AI systems. They are seeking a highly skilled Research Engineer to optimize training and inference workloads on their infrastructure, working across models and systems to improve performance and scalability.
Responsibilities:
- Optimize large-scale training and inference workloads across GPUs, accelerators, and distributed systems
- Work directly with customers to analyze workloads, identify bottlenecks, and improve performance, scalability, and reliability of deployed AI systems
- Develop and improve inference pipelines, model serving systems, and performance-oriented tooling for production AI workloads
- Design and implement profiling, debugging, and observability tools to analyze model execution and guide optimization strategies
- Work across the software stack to ensure performance improvements are accessible through clean APIs, automation, and seamless integration with the Lightning ecosystem
- Partner with hardware vendors and ecosystem partners to support efficient execution across diverse compute backends (NVIDIA, TPU, and emerging accelerators)
- Contribute to open-source projects through new features, tooling improvements, documentation, and community engagement
- Stay current with advancements in large-scale inference, distributed training, and ML systems optimization