Renaissance Learning is a global leader in pre-K–12 education technology, providing solutions that help educators enhance student learning experiences. They are seeking an experienced Sr Site Reliability Engineer to join their Engineering Enablement group, focusing on improving application and infrastructure reliability and security while supporting their SaaS platform used by millions of students.
Responsibilities:
- Work with engineering, security & governance teams to improve observability, reliability, resiliency, auditability of our systems and minimize/prevent downtime
- Contribute to infrastructure-as-code using Terraform & CloudFormation
- Support CI/CD pipelines which ensures the prompt release of high-quality software
- Collaborate with cross-functional teams to resolve infrastructure issues
- Perform Disaster Recovery exercises on our products
- Explore and integrate AI tooling into the SRE workflows
- Be part of an on-call rotation & support off hour incidents & deployments
- Demonstrates strong skills in giving constructive feedback through coaching even without direct reports
Requirements:
- 5+ years of experience focused on SRE
- Experience in managing & monitoring containerized cloud environments in production, preferably AWS EKS
- Experience with IaC, Configuration Management and Orchestration Tools like Terraform/Docker/Ansible
- Hands-on experience in any of the programming or scripting languages like .NET/Java, Python, Javascript etc
- On Call experience & willingness to be on call during non-work hours and weekends
- Experience working in an agile environment
- BS in Information Systems or Computer Science, related field experience, or both
- Managing Kubernetes Clusters, EKS at Scale using Helm
- Setting up Gitlab & Github pipelines & workflows
- Experience setting up Monitoring, Logging, Alerting & Observability in tools such as NewRelic, Datadog, Grafana. CloudWatch, PagerDuty
- Experience w/Teleport, Hashicorp Boundary etc
- Experience w/RedShift, OpenSearch/ZeroETL
- Experience running Disaster Recovery exercises
- Implementing service level objectives (SLO/SLI/SLA's) & error budgets
- Experience using ClaudeCode using agentic coding, agentic SDLC, enabling/rolling-out agentic DX