Improving production reliability and system resilience within an SRE scoped team
Championing high standards of work and industry best practices
Communicating with teams and stakeholders at all stages
Bringing fresh ideas to the table and encouraging others
Diving into complex technical problems with a can-do attitude
Working across numerous technologies in a fast-changing industry
Participating in on-call rotation, incident response, and blameless post-incident reviews
Writing code, handling alerts, improving solutions, and supporting others
Playing a crucial role in the success of your company and team

5+ years administering Linux systems and related infrastructure in production environments
A collaborative SRE mindset, with familiarity around SLIs/SLOs/SLAs, error budgets, blast radius, and blameless postmortems
A focus on automation, reducing toil, and preventing problem recurrence
A track record of writing runbooks that work for the broader team, not just yourself
Strong Kubernetes and broader ecosystem fundamentals
Cloud infrastructure experience; AWS strongly preferred and bare-metal is a bonus
Strong tool development
Bash, plus either Python or Go preferred, or similar
Infrastructure-as-code tooling experience
Terraform preferred
CI/CD and version control, GitHub preferred
Database experience
one of Postgres, Cassandra, or ClickHouse preferred
Experience operating a production observability stack (metrics, logs, traces), with an eye for signal over noise
Comfortable working on live production infrastructure, with strong troubleshooting instincts and ownership of incident response
A history of continual professional development
A self-directed style suited to an async, globally distributed team, and comfortable picking up adjacent work when the situation calls for it

Senior Site Reliability Engineer

Key skills