Focus on understanding why production incidents happen and how to prevent them from recurring
Analyze incidents end-to-end across applications, infrastructure, and cloud environments, using observability data to identify root causes, patterns, and systemic weaknesses
Turn incident insights into high-quality postmortems and partner with engineering teams to drive corrective actions and long-term improvements
Help shift the organization from reactive response to proactive reliability and incident prevention
Partner with engineering and software development teams to implement permanent fix and preventive improvements

7+ years in Systems Engineering, ITSM, RM/CM
Background in SRE, Support or QA
One or more of the following SRE Tools: T-APM, T-Trace, CatchPoint, Grafana
Hands-on experience and understanding of concepts and tools such as SAFe, Agile, DevOps, CI/CD, Data Analytics, and building Gen AI use cases
Experience with AI technologies, Python, SQL, data analytics, Power BI and ITSM tools (e.g., ServiceNow)
Modern Enterprise Release Management/Change Management and ITSM

Medical/Dental/Vision coverage
401(k) plan
Tuition reimbursement program
Paid Time Off and Holidays (based on date of hire, at least 23 days of vacation each year and 9 company-designated holidays)
Paid Parental Leave
Paid Caregiver Leave
Additional sick leave beyond what state and local law require may be available but is unprotected
Adoption Reimbursement
Disability Benefits (short term and long term)
Life and Accidental Death Insurance
Supplemental benefit programs: critical illness/accident hospital indemnity/group legal
Employee Assistance Programs (EAP)
Extensive employee wellness programs
Employee discounts up to 50% off on eligible AT&T mobility plans and accessories, AT&T internet (and fiber where available) and AT&T phone.

Principal System Engineer

Key skills