NVIDIA is a leading company in computer graphics and AI innovation, and they are seeking a Principal Developer in AI Networking. The role involves profiling, analyzing, and optimizing AI workloads on large-scale GPU and CPU clusters, with a focus on networking and communication within distributed deep learning systems.
Responsibilities:
- Characterizing AI workloads and deep learning models aimed at large-scale LLM training and inference on NVIDIA supercomputers. The role centers on distributed systems with a focus on high-performance networking and NVIDIA communication libraries
- Benchmarking, profiling, and analyzing the performance to find bottlenecks and identify areas for improvement and optimizations, with a strong emphasis on networking aspects
- Developing PyTorch trace-based profiling, analysis, and replaying toolset to aid in benchmarking, debugging, and co-designing network systems for LLM workloads
- Collaborating with multiple teams from hardware to software to provide performance analysis insights
- Defining performance test plans, setting performance expectations for new technologies and solutions, and working to achieve performance targets