Todyl is seeking a Senior Site Reliability Engineer to enhance the reliability and security of their platform. The role involves owning the design and delivery of Kubernetes-based platform initiatives, mentoring team members, and collaborating with Architecture and Security on critical platform decisions.
Responsibilities:
- Own end-to-end design and delivery of flagship platform initiatives, designing for failure modes, graceful degradation, and the scale we expect 12 months from now rather than just today. The headline 12–18 month deliverable for this role is the golden-path platform: a developer-facing self-service path to production that enforces infrastructure best practices without requiring SRE involvement
- Drive security automation at platform scale, including patching cadence, secret rotation, access controls, and CVE remediation, as ongoing operational practices rather than reactive sprints
- Partner with product engineering teams at the architecture phase of high-stakes systems, helping shape the design rather than reviewing it the week before launch
- Operate as a peer to Architecture and Security on platform decisions that affect how Todyl runs production over the next 2–3 years
- Mentor less-tenured SREs through pairing, code review, and design partnership, with measurable improvement in their autonomy on design and incident work
- Contribute to one or more SRE practice improvements adopted by the team: incident commander discipline, postmortem maturity, change management standards, on-call quality, or design review cadence
- Build and operate the production platform: Kubernetes with Helm and ArgoCD, CI/CD pipelines, infrastructure-as-code (Terraform, Salt), observability (Grafana, Prometheus), secrets management, and AWS (including EKS). We're shifting from reactive to proactive, and we'd rather build guardrails than approve every deploy
- Drive cost visibility and efficiency across our cloud footprint, including AWS resource tagging, COGs attribution, and right-sizing across the platform, and you'll quantify the business impact in terms that leadership can act on
- Participate in a weekly on-call rotation, resolve most issues independently, and own postmortems and follow-up actions for the incidents you respond to
- Plan and estimate honestly, break multi-quarter work into smaller increments, communicate delays early, and write tests for the automation you build because it runs in production
- Treat code review as a quality lever, not a checkbox. Catch missing tests, push back on tech debt, and watch dashboards and logs to verify your own changes after they ship
- When something you've built is mature and stable, you'll look for ways to hand it off or make it self-managing rather than holding onto it forever
Requirements:
- 5+ years of Site Reliability Engineering or platform-engineering experience
- Owned major platform initiatives end-to-end, from design through stabilization
- Recognized as the go-to person in their technical domain
- Create design documentation that teams reference long after the work ships
- Mentors less-tenured engineers through pairing, design partnership, and example
- Sees SRE as a service to the engineering organization, not a gate
- Builds trust with developers and makes other teams' jobs easier
- Treats security as a normal part of operating the platform
- Demonstrated experience designing systems with security as a first-class concern
- Energized by eliminating toil and looking at repetitive work
- Actively uses AI tooling in day-to-day work
- Influences how the team adopts AI patterns safely
- Can communicate technical decisions clearly to engineers, engineering leadership, and non-engineering stakeholders
- Comfortable saying no or pushing back constructively when it matters
- Own end-to-end design and delivery of flagship platform initiatives
- Drive security automation at platform scale
- Partner with product engineering teams at the architecture phase of high-stakes systems
- Operate as a peer to Architecture and Security on platform decisions
- Mentor less-tenured SREs through pairing, code review, and design partnership
- Contribute to one or more SRE practice improvements adopted by the team
- Build and operate the production platform: Kubernetes with Helm and ArgoCD, CI/CD pipelines, infrastructure-as-code (Terraform, Salt), observability (Grafana, Prometheus), secrets management, and AWS (including EKS)
- Drive cost visibility and efficiency across cloud footprint
- Participate in a weekly on-call rotation
- Resolve most issues independently
- Own postmortems and follow-up actions for incidents
- Plan and estimate honestly
- Break multi-quarter work into smaller increments
- Communicate delays early
- Write tests for the automation built because it runs in production
- Treat code review as a quality lever
- Catch missing tests, push back on tech debt
- Watch dashboards and logs to verify changes after they ship
- Look for ways to hand off or make mature and stable builds self-managing
- Familiarity with Kubernetes (EKS), Helm, ArgoCD, containerization
- Familiarity with AWS (including EKS, ECR, and IAM) and cloud-native infrastructure
- Familiarity with Infrastructure-as-code (Terraform, Salt)
- Familiarity with CI/CD pipelines and GitOps (GitHub Actions, ArgoCD)
- Familiarity with observability stack (Grafana, Prometheus)
- Familiarity with Linux at scale
- Familiarity with Python or Bash for tooling
- Familiarity with networking fundamentals
- Familiarity with security-conscious infrastructure design (patching, secrets management, access controls)
- Familiarity with Git and modern development workflows