Drive adoption and optimization of observability tools
Identify opportunities to apply AI to reduce manual operational effort
Ensure AIOps solutions are implemented with strong governance, security, auditability, and operational trustworthiness
Create playbooks, standards, reusable patterns, and operating models
Mentor engineers and operators in modern operations practices
Requirements
Typically BS + 12 years or MS + 10 years (or equivalent)
Strong track record leading cloud operations, platform operations, SRE, observability, or AIOps initiatives across complex enterprise environments
Strong hands-on experience designing and operating workloads on AWS
Expertise across compute, networking, storage, security, automation, and cloud operations patterns
Deep experience with modern observability and monitoring platforms such as New Relic, OpenSearch, and related tools
Proven experience applying AI to operations use cases such as event correlation, anomaly detection, alert reduction, root cause analysis, remediation support, and operational workflow automation
Strong experience designing or implementing AI agents, agentic workflows, or multi-agent systems
Strong grounding in site reliability engineering principles including service reliability, SLOs/SLIs, error budgets, automation, incident management, resilience, and continuous improvement
Demonstrated success building or scaling ChatOps practices
Strong knowledge of scripting, infrastructure automation, operational tooling, APIs, event-driven systems, and platform integration patterns
Ability to translate operational pain points into scalable technical solutions
Able to influence technical teams and senior leaders
Experience implementing operational AI responsibly
Tech Stack
AWS
Cloud
Benefits
401k plan with employer match
Flexible paid time off
Holidays
Parental leaves
Life and disability insurance
Health benefits including medical, dental, vision, and prescription drug coverage