Work closely with hardware, development teams to profile and analyze GPU performance at the system and kernel level.
Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g., CUDA, ROCm).
Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads.
Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimizations on performance and scalability.
Requirements
Proficient in Unix/Linux, plus Python and Bash for automation.
Good understanding of the GPU stack: CUDA, NCCL, drivers, and relevant libraries
Proven ability to troubleshoot complex system issues including hardware, software, and networking problems.
Familiarity with containerized environments (e.g., Docker, Kubernetes).