Atoms is building the machines that power the next era of progress. They are seeking a Cluster Infrastructure Engineer to own the GPU compute fabric that trains their foundation models, focusing on optimization, automation, and scalability.

Responsibilities:

Manage and automate our GPU training clusters, including provisioning, bootstrapping, and lifecycle management
Automate bare-metal bring-up so new machines come online quickly and reliably as we add capacity
Build software abstractions that present a clean, unified interface to our training and simulation workloads
Work at the hardware/software boundary, where speed and reliability are critical, continuously raising the bar for automation and uptime
Run day-to-day operations: diagnose and resolve issues quickly when systems are under pressure
Design our infrastructure to scale smoothly as we grow from a smaller cluster of machines toward a larger fleet

Staff Cluster Infrastructure Engineer

About this role

Responsibilities: