Build the roadmap for devops and observability in Data & AI teams.
Design and build cloud infrastructure as code with Terraform (or Pulumi / CloudFormation), packaging reusable modules for AWS, Azure or GCP.
Own CI/CD pipelines in GitHub Actions, Jenkins or GitLab CI — build, test, security scanning, blue-green or canary deploys, and automated rollback.
Operate Kubernetes clusters (EKS, AKS or GKE) and container workloads with LENS, Helm, ArgoCD or Flux — including autoscaling, ingress, secrets and policy.
Build observability with Prometheus, Grafana, OpenTelemetry, ELK or Datadog — metrics, logs, traces, dashboards and SLO-driven alerting.
Implement security and compliance controls: IAM, SSO, secrets management (Vault / KMS), vulnerability scanning, policy-as-code (OPA, Checkov) and PCI-aware patterns.
Lead incident response — on-call, runbooks, blameless post-mortems, and continuous reliability work to drive down MTTR and toil.
Partner with developers on local dev experience, golden paths, internal platform tooling and developer self-service.
Help shape internal platform standards as the stack evolves, contributing to design reviews and sharing knowledge across the India and U.S. teams.
Participate in a collaborative DevOps environment, working closely with developers, AI engineers, QA, DBAs and product partners across environments.
Requirements
8+ years of professional DevOps, SRE or platform-engineering experience operating production services
3+ years of hands-on work building CI/CD pipelines (GitHub Actions, Jenkins, GitLab CI or CircleCI) and managing infrastructure as code (Terraform, Pulumi or CloudFormation)
Working knowledge of Kubernetes (EKS, AKS or GKE) and container tooling (Docker, Helm, ArgoCD or Flux)
Strong scripting skills in Python, Bash or Go; solid SQL skills and strong comfort with at least one cloud platform (AWS, Azure or GCP)
Hands-on experience with observability stacks: New Relic, Prometheus, Grafana, OpenTelemetry, ELK or Datadog
Solid understanding of cloud security and compliance practices, particularly in PCI-compliant or regulated environments
Proven ability to work independently and within a team, managing priorities across concurrent projects and time zones, including on-call rotations
Strong written and verbal communication skills; able to work effectively with both technical and non-technical stakeholders
Bonus Skills: Experience operating Dataiku DSS, Snowflake, or other large-scale data and analytics platforms in production
Experience with service meshes (Istio, Linkerd), API gateways, and zero-trust networking
Experience with policy-as-code (OPA / Rego, Checkov, tfsec) and supply-chain security (SBOM, Sigstore)
Experience with FinOps practices and cloud cost optimization
Experience supporting ML or LLM workloads — GPU scheduling, model-serving infra, vector databases or LangSmith / Langfuse
Experience with database administration / reliability for PostgreSQL, MySQL or Snowflake