Own reliability outcomes for Tango’s cloud platform across production and non-production environments
Design, implement, and operate SLOs/SLIs, error budgets, and reliability reporting
Drive prioritization of reliability work with Engineering and Product
Build and maintain observability foundations: metrics, logging, tracing, dashboards, and alerting
Lead incident response and post-incident reviews
Engineer and evolve CI/CD and release safety practices
Improve infrastructure-as-code and environment consistency
Partner with Security and Compliance to support secure operations
Optimize cloud cost and capacity through right-sizing and performance tuning
Enable engineering teams with reliable internal tooling and automation
Mentor engineers on reliability best practices

8+ years of experience in Site Reliability Engineering, DevOps, or Production Engineering supporting distributed SaaS applications
Strong background in Linux systems engineering
Networking fundamentals (TCP/IP, DNS, load balancing)
Proficiency with at least one programming language used for automation (e.g., Python, Go, or Java)
Strong scripting skills
Hands-on experience with cloud infrastructure (AWS, Azure, or GCP)
Deep experience with infrastructure-as-code and configuration management (e.g., Terraform, CloudFormation, Ansible)
Expertise in containerization and orchestration (Docker, Kubernetes)
Strong observability practices with tools such as Prometheus/Grafana, Datadog, New Relic, ELK/Splunk
Incident management leadership with a focus on root cause analysis
Experience designing and operating CI/CD pipelines and release management practices
Ability to work cross-functionally with Engineering, Product, Support, and Security
Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
Relevant certifications are a plus (e.g., AWS/Azure/GCP, Kubernetes CKA/CKAD, ITIL)

Senior Site Reliability Engineer

Key skills