Atoms is building the machines that power the next era of progress. They are seeking a Cluster Infrastructure Engineer to own the GPU compute fabric that trains their foundation models, focusing on optimization, automation, and scalability.
Responsibilities:
- Manage and automate our GPU training clusters, including provisioning, bootstrapping, and lifecycle management
- Automate bare-metal bring-up so new machines come online quickly and reliably as we add capacity
- Build software abstractions that present a clean, unified interface to our training and simulation workloads
- Work at the hardware/software boundary, where speed and reliability are critical, continuously raising the bar for automation and uptime
- Run day-to-day operations: diagnose and resolve issues quickly when systems are under pressure
- Design our infrastructure to scale smoothly as we grow from a smaller cluster of machines toward a larger fleet