Work with customers in technical discovery to help define requirements and deliverables for their use cases and help them to effectively utilize distributed GPU computing resources.
Identify and recommend the best tools for each customer, while building boilerplate and reference implementations that can be reused for subsequent customers.
Manage GPU clusters and coordinate with IT to ensure efficient operation of NVIDIA GPUs, utilizing technologies such as InfiniBand for high-speed networking.
Oversee the deployment and maintenance of machine learning environments using virtual storage solutions and distributed computing (HPC) tools such as SLURM.
Automate infrastructure provisioning and management using tools such as Ansible and Terraform.
Collaborate with data scientists and engineers to ensure seamless integration of ML models into production environments.
Conduct performance tuning and optimization of systems to maximize throughput and reduce latency.
Stay current with the latest industry trends in machine learning technologies and HPC to ensure the use of best practices in infrastructure setup and model development.
Document and maintain operational procedures and system configurations.
Requirements
7+ years of high performance compute, distributed machine learning, GPU computing, and/or system architecture experience
Proficient in managing NVIDIA GPU environments and a familiarity with GPU computing frameworks and libraries
Strong experience with high-speed networking technologies, specifically InfiniBand
Experience with HPC job schedulers, preferably SLURM
Expertise in automating environment setup and maintenance using Ansible and Terraform
Demonstrated ability in deploying and managing virtual storage solutions
Strong coding skills in Python and familiarity with machine learning libraries and frameworks
Excellent problem-solving, communication, and teamwork skills.
Tech Stack
Ansible
Python
Terraform
Benefits
You will work with the most diverse hardware configurations and locations available to anyone in the industry as well as cutting edge GPU use cases
This role is fully remote with a high accountability and high agency culture
This role offers a competitive salary, equity and benefits