Fort Mill, South Carolina, United States of America
Full Time
2 weeks ago
$112,476 - $187,460 USD
Visa Sponsor
Key skills
CloudLeadershipCollaboration
About this role
Role Overview
Lead and coordinate cross‑functional technical teams during major and critical incidents, ensuring timely recovery and effective stakeholder engagement.
Serve as a recovery lead during declared major incidents, maintaining focus on service restoration and customer impact.
Participate in and facilitate post‑incident reviews and post‑mortems, ensuring outcomes are actionable and measurable.
Drive high‑quality root cause analysis for major incidents using structured techniques such as 5‑Why, Fishbone, and Blameless RCA.
Ensure contributing factors (process, technology, observability, automation, or human factors) are clearly identified and documented.
Partner with domain teams to translate findings into concrete remediation actions.
Develop, document, and maintain incident recovery plans, SOPs, runbooks, and playbooks in collaboration with domain owners.
Support and execute mock drills, recovery tests, and readiness exercises to improve response effectiveness.
Ensure recovery documentation remains accurate, consumable, and operationally relevant.
Work with application, infrastructure, and platform teams to improve diagnostic accuracy and time‑to‑engage during incidents.
Help establish clear ownership, escalation paths, and recovery patterns to reduce dependency on ad‑hoc tribal knowledge.
Promote repeatable recovery patterns across services.
Identify opportunities to improve service reliability, operational maturity, and recovery effectiveness.
Analyze incident data and trends to recommend targeted improvements across people, process, and technology.
Support adoption of SRE‑aligned practices, including error budgets, readiness reviews, and failure mode awareness.
Provide structured feedback to Observability, Automation, Resiliency, and Domain teams on; gaps in monitoring, alerts, and diagnostics; single points of failure; architectural or design weaknesses impacting recoverability
Act as an operational voice to ensure post‑incident learnings inform engineering and platform decisions.
Mentor junior recovery managers or operational staff through hands‑on incident participation and coaching.
Contribute to operational training sessions, tabletop exercises, and knowledge‑sharing initiatives.
Maintain awareness of industry best practices in production operations, incident management, and SRE.
Requirements
5+ years of experience in Production Services, Incident Management, Recovery Management, Problem Management, SRE, DevOps, or related disciplines
2+ years of application, infrastructure, and/or cloud technologies, enabling effective triage and informed recovery leadership
2+ years experience using observability tools, logs, metrics, and diagnostics to troubleshoot production issues