Define and monitor SLOs with product squads, actively using error budgets to guide prioritization decisions and control release velocity.
Lead war rooms for critical incidents (P1/P2) end-to-end (triage, diagnosis, resolution) and conduct blameless post-mortems (5 Whys).
Operate and ensure the scalability, health, and security of production GKE (Google Kubernetes Engine) clusters, while maintaining visibility over workloads in AWS and Azure.
Design and evolve Cloud Build + ArgoCD pipelines with mandatory quality gates (SonarQube, image scanning, smoke tests) and define rollout strategies (canary, blue/green).
Structure and maintain Terraform modules for multi-project GCP environments, managing remote state, drift detection, and policy-as-code.
Ensure production readiness with structured logs, traces via OpenTelemetry, alerts in Datadog/Grafana, and integrate security tools (SAST, Trivy, Secret Manager) without introducing friction into the delivery flow.
Provide structured mentorship to the team (internal staff and consultants), use code reviews as a teaching tool, and organize a sustainable on-call rota.