Focus on understanding why production incidents happen and how to prevent them from recurring
Analyze incidents end-to-end across applications, infrastructure, and cloud environments, using observability data to identify root causes, patterns, and systemic weaknesses
Turn incident insights into high-quality postmortems and partner with engineering teams to drive corrective actions and long-term improvements
Help shift the organization from reactive response to proactive reliability and incident prevention
Partner with engineering and software development teams to implement permanent fix and preventive improvements
Requirements
7+ years in Systems Engineering, ITSM, RM/CM
Background in SRE, Support or QA
One or more of the following SRE Tools: T-APM, T-Trace, CatchPoint, Grafana
Hands-on experience and understanding of concepts and tools such as SAFe, Agile, DevOps, CI/CD, Data Analytics, and building Gen AI use cases
Experience with AI technologies, Python, SQL, data analytics, Power BI and ITSM tools (e.g., ServiceNow)
Modern Enterprise Release Management/Change Management and ITSM
Tech Stack
Cloud
Grafana
ITSM
Python
ServiceNow
SQL
Benefits
Medical/Dental/Vision coverage
401(k) plan
Tuition reimbursement program
Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
Paid Parental Leave
Paid Caregiver Leave
Additional sick leave beyond what state and local law require may be available but is unprotected