AWSFluxGrafanaKubernetesPostgresPrometheusPythonTerraformGoGitLab CIEKSHelmS3RDSIAMGitLabGitOpsCI/CDCommunicationRemote Work
About this role
Role Overview
Own and evolve our Kubernetes platform across multiple clusters: manage Helm chart deployments via an OCI registry hosting 40+ charts, enforce policy-as-code with Kyverno, and operate GitOps workflows through Argo CD ApplicationSets with progressive delivery orchestrated by Kargo
Drive technical designs for platform initiatives: scope the problem, propose multiple solutions, assess their trade-offs, and defend your recommendation — then see it through to production
Harden the platform's security posture: workload identity via OIDC, runtime security, image scanning and secrets management
Write and maintain custom Kubernetes operators and internal tooling in Go and Python that multiply the team's leverage across clusters — we run Zalando postgres-operator alongside our own operators
Maintain and improve our observability stack (Prometheus, Grafana, Thanos, OpenSearch): build dashboards and alerts that give product teams real visibility into their services
Keep GitLab CI/CD pipelines fast and reliable — they power ~150 production deployments per month; execute cross-team changes (rollouts, database migrations, certificate rotations) with care and clear communication
Operate and evolve AWS infrastructure (EKS, VPC, IAM, RDS, S3) including dedicated customer environments in their own AWS accounts; drive cost-efficiency initiatives tracked via OpenCost
Own incidents end-to-end: from alert to fix to postmortem
Raise the team's bar through thorough code and architecture reviews, mentor less experienced engineers, and help us assess technical candidates in interviews
Be the infrastructure partner product teams come to when things are unclear or broken
Participate in a shared on-call rotation
Requirements
5+ years of professional engineering experience, including 3+ years in infrastructure, platform, or site reliability engineering
Deep hands-on experience with Kubernetes: cluster operations, workload management, troubleshooting at scale
Helm chart authoring: writing, packaging, and maintaining charts — not just consuming them
GitOps experience with Argo CD or an equivalent tool (Flux, etc.)
Working knowledge of AWS (EKS and supporting services such as IAM, VPC, RDS, S3)
Experience with Infrastructure as Code (Terraform or equivalent)
Proficiency in Go or Python — we write custom operators and internal tooling in both
Experience owning production incidents end-to-end — response, mitigation, and postmortem
Strong English communication skills, with the ability to explain technical decisions to both engineers and non-technical stakeholders
Tech Stack
AWS
Flux
Grafana
Kubernetes
Postgres
Prometheus
Python
Terraform
Go
Benefits
Flexible working arrangements (remote, office, or hybrid)
Modern office in the heart of Hanover for hybrid work
Up to 180 days (6 months) of remote work from abroad
Competitive compensation with benefits offering 30 days (6 weeks) of paid vacation