Deploy, upgrade, and maintain platform services across multiple clouds and regions on Kubernetes.
Build and maintain CI/CD pipelines
Make it safe and fast to ship infrastructure changes using GitOps workflows and release automation.
Build control planes
Create the APIs and tooling that make provisioning and scaling repeatable and self-service.
Own capacity planning
Track usage, forecast growth, right-size clusters, and keep infrastructure costs in check.
Build observability
Set up metrics, dashboards, and alerts using Prometheus and Grafana.
Write runbooks that make on-call clear and actionable.
Own on-call and incidents
Join the on-call rotation, resolve issues, write postmortems, and turn repeat problems into automation.
Automate everything
Deployments, upgrades, certificate rotations, failover. If you do it by hand more than once, automate it.
Driving system reliability by blending software engineering principles with AI-driven automation, moving from reactive firefighting to proactive, automated operations.
Harden security
set up auth, encryption, secret rotation, and network policies.
Keep dependencies patched and CVEs resolved.
Own disaster recovery
Build backup strategies, test failover, and make sure platforms can survive infrastructure failures.
Enable other teams
Provide templates, patterns, and direct support to help engineering teams use platforms reliably.
Collaborate across teams
Collaborate with Infrastructure, SRE, and Data Services on shared operational problems.
Requirements
8+ years in DevOps, SRE, or platform engineering.
Hands-on experience running stateful distributed systems on Kubernetes in production.
CI/CD experience
Building and owning pipelines using GitHub Actions, Jenkins, Tekton, or similar tools.
Infrastructure-as-code skills
Terraform, Pulumi, or Crossplane, no manual configuration.
GitOps experience
ArgoCD or Flux for managing infrastructure deployments.
Observability skills
Prometheus, Grafana, and distributed tracing tools like Jaeger or OpenTelemetry.
Database operations
Backup, restore, schema management, and performance tuning for relational and NoSQL databases.
Security mindset
You implement auth, encryption, secret management, and network policies as part of normal work.
Multi-cloud or multi-region experience
you have managed infrastructure across providers or regions.
Tech Stack
Cloud
Distributed Systems
Flux
Grafana
Jenkins
Kubernetes
NoSQL
Prometheus
Terraform
Benefits
Market leader in compensation and equity awards
Comprehensive physical and mental wellness programs
Competitive vacation and holidays for recharge
Paid parental and adoption leaves
Professional development opportunities for all employees regardless of level or role
Employee Networks, geographic neighborhood groups, and volunteer opportunities to build connections