Fluidstack is focused on building civilization-scale infrastructure for AI. The Principal Operations Engineer will lead critical operations and ensure operational excellence across data center sites, driving improvements and managing incidents effectively.

Responsibilities:

Take the on-call escalation when a site hits trouble and triage it virtually, using real knowledge of the team and the systems to decide what to escalate, when, and how to keep the field crew focused without burying them
Get on a plane when it matters: travel site to site (50%+) to work live incidents and post-incident reviews on the floor, and bring the practices that worked elsewhere with you
Own root cause analysis on significant events through to closure and track corrective actions to done, killing the underlying class of failure rather than the one instance in front of you
Read the patterns across the fleet’s incidents and RCAs, push the few highest-value learnings through to closure, and stay honest about what’s achievable and what to drop instead of boiling the ocean
Carry learnings and practices from one campus to the next so a fix at one site becomes the standard everywhere before the failure repeats
Write the operational Assessment standard and audit each campus against it, feeding what you find straight back into the corrective-action loop

Requirements:

You've run a live critical operation and led a team of operators, and you carry the deep, earned judgment that comes from owning the floor when it counts
You've been the person a site calls when something breaks, triaged the problem over the phone, and known exactly when to escalate and when to let the field team work it
You've authored root cause analyses on significant events and tracked corrective actions to closure, and you can show the difference between an RCA that closed a ticket and one that killed a class of failure
You've sat with a pile of RCA actions and cut it to the few that matter, because you know an operation that commits to everything finishes nothing
You've traveled site to site, walked the floor, and left each operation better than you found it, carrying the practices that worked from one into the next
You've written the standard, not just followed it, audited real sites against it without flinching from what you found, and can hold one bar across domains you don't all live in
Building an assessment, audit, qualification, or training program from scratch
Bonus: Hyperscale or large colocation at hundreds of MW+
Direct exposure to Hardware or Network operations, not only Facilities
Experience with incident.io or equivalent incident tooling, plus DCIM

Principal Operations Engineer, Reliability — Data Center Operations

Key skills

About this role

Responsibilities:

Requirements: