Own strategy, roadmap, and delivery for Runtime SRE and Cloud Operations to meet enterprise Service Level Objectives (SLOs) and operational Service-Level Agreements (SLAs)
Lead, mentor, and grow teams responsible for runtime SRE (SLOs/SLIs, observability, performance engineering, Disaster Recovery (DR), chaos testing) and Cloud Operations
Establish and own incident management processes: detection, escalation, incident command, post-incident reviews, and remediation planning; ensure rapid detection and reduced Mean Time to Repair (MTTR)
Drive observability and telemetry strategy (metrics, tracing, logs) to ensure actionable alerts and proactive detection of platform issues
Lead capacity planning, performance tuning, and disaster recovery orchestration for platform services and multi-cluster fleets
Convert Root Cause Analysis (RCA) outcomes into prioritized engineering work
Define and measure operational Key Performance Indicator (KPIs) and implement automation to reduce manual toil
Own on-call and rotation policies, runbook quality, bridge setup SLAs, and operational playbooks; ensure teams are trained and drills executed regularly
Ensure security, compliance, and change management controls are integrated into operational procedures and emergency responses
Requirements
5+ years in cloud operations, SRE, and/or related roles
3+ years managing technical teams with on-call responsibilities
3+ years of experience with Kubernetes at scale and multi-cloud runtime platforms (EKS/AKS/GKE)
3+ years of experience with observability tooling (Prometheus, Grafana, OpenTelemetry, Elasticsearch, Logstash, Kibana (ELK), Fluentd, Kibana (EFK), tracing) and alerting design
Experience owning incident response and improving reliability metrics in production environments
Experience with capacity planning, performance engineering, and disaster recovery at cloud scale
Experience with automation tooling (Terraform, CI/CD, operators) and integrating reliability into IaC pipelines