Todyl is seeking a Senior Site Reliability Engineer to enhance the reliability and security of their platform. The role involves owning the design and delivery of Kubernetes-based platform initiatives, mentoring team members, and collaborating with Architecture and Security on critical platform decisions.

Responsibilities:

Own end-to-end design and delivery of flagship platform initiatives, designing for failure modes, graceful degradation, and the scale we expect 12 months from now rather than just today. The headline 12–18 month deliverable for this role is the golden-path platform: a developer-facing self-service path to production that enforces infrastructure best practices without requiring SRE involvement
Drive security automation at platform scale, including patching cadence, secret rotation, access controls, and CVE remediation, as ongoing operational practices rather than reactive sprints
Partner with product engineering teams at the architecture phase of high-stakes systems, helping shape the design rather than reviewing it the week before launch
Operate as a peer to Architecture and Security on platform decisions that affect how Todyl runs production over the next 2–3 years
Mentor less-tenured SREs through pairing, code review, and design partnership, with measurable improvement in their autonomy on design and incident work
Contribute to one or more SRE practice improvements adopted by the team: incident commander discipline, postmortem maturity, change management standards, on-call quality, or design review cadence
Build and operate the production platform: Kubernetes with Helm and ArgoCD, CI/CD pipelines, infrastructure-as-code (Terraform, Salt), observability (Grafana, Prometheus), secrets management, and AWS (including EKS). We're shifting from reactive to proactive, and we'd rather build guardrails than approve every deploy
Drive cost visibility and efficiency across our cloud footprint, including AWS resource tagging, COGs attribution, and right-sizing across the platform, and you'll quantify the business impact in terms that leadership can act on
Participate in a weekly on-call rotation, resolve most issues independently, and own postmortems and follow-up actions for the incidents you respond to
Plan and estimate honestly, break multi-quarter work into smaller increments, communicate delays early, and write tests for the automation you build because it runs in production
Treat code review as a quality lever, not a checkbox. Catch missing tests, push back on tech debt, and watch dashboards and logs to verify your own changes after they ship
When something you've built is mature and stable, you'll look for ways to hand it off or make it self-managing rather than holding onto it forever

Requirements:

5+ years of Site Reliability Engineering or platform-engineering experience
Owned major platform initiatives end-to-end, from design through stabilization
Recognized as the go-to person in their technical domain
Create design documentation that teams reference long after the work ships
Mentors less-tenured engineers through pairing, design partnership, and example
Sees SRE as a service to the engineering organization, not a gate
Builds trust with developers and makes other teams' jobs easier
Treats security as a normal part of operating the platform
Demonstrated experience designing systems with security as a first-class concern
Energized by eliminating toil and looking at repetitive work
Actively uses AI tooling in day-to-day work
Influences how the team adopts AI patterns safely
Can communicate technical decisions clearly to engineers, engineering leadership, and non-engineering stakeholders
Comfortable saying no or pushing back constructively when it matters
Own end-to-end design and delivery of flagship platform initiatives
Drive security automation at platform scale
Partner with product engineering teams at the architecture phase of high-stakes systems
Operate as a peer to Architecture and Security on platform decisions
Mentor less-tenured SREs through pairing, code review, and design partnership
Contribute to one or more SRE practice improvements adopted by the team
Build and operate the production platform: Kubernetes with Helm and ArgoCD, CI/CD pipelines, infrastructure-as-code (Terraform, Salt), observability (Grafana, Prometheus), secrets management, and AWS (including EKS)
Drive cost visibility and efficiency across cloud footprint
Participate in a weekly on-call rotation
Resolve most issues independently
Own postmortems and follow-up actions for incidents
Plan and estimate honestly
Break multi-quarter work into smaller increments
Communicate delays early
Write tests for the automation built because it runs in production
Treat code review as a quality lever
Catch missing tests, push back on tech debt
Watch dashboards and logs to verify changes after they ship
Look for ways to hand off or make mature and stable builds self-managing
Familiarity with Kubernetes (EKS), Helm, ArgoCD, containerization
Familiarity with AWS (including EKS, ECR, and IAM) and cloud-native infrastructure
Familiarity with Infrastructure-as-code (Terraform, Salt)
Familiarity with CI/CD pipelines and GitOps (GitHub Actions, ArgoCD)
Familiarity with observability stack (Grafana, Prometheus)
Familiarity with Linux at scale
Familiarity with Python or Bash for tooling
Familiarity with networking fundamentals
Familiarity with security-conscious infrastructure design (patching, secrets management, access controls)
Familiarity with Git and modern development workflows

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: