Ellucian is a company that powers innovation for higher education, serving over 21 million students globally. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and cost-efficiency of their production systems, focusing on DevOps practices and incident management.
Responsibilities:
- Own and improve system reliability, availability, and performance for production environments
- Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
- Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
- Perform detailed root cause analysis (RCA) and drive permanent resolutions
- Partner with engineering and DevOps teams to build scalable, resilient infrastructure
- Automate operational processes to improve efficiency and reduce risk
- Analyze and optimize infrastructure and application costs
- Define and manage SLIs/SLOs to meet reliability targets
- Continuously improve deployment, monitoring, and operational practices
Requirements:
- 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
- Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
- Experience with cloud platforms (AWS, Azure, or GCP)
- Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
- Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
- Experience with containers and orchestration (Docker, Kubernetes)
- Scripting or programming experience (Python, Bash, or similar)
- Proven ability to analyze and optimize cloud costs
- Own and improve system reliability, availability, and performance for production environments
- Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
- Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
- Perform detailed root cause analysis (RCA) and drive permanent resolutions
- Partner with engineering and DevOps teams to build scalable, resilient infrastructure
- Automate operational processes to improve efficiency and reduce risk
- Analyze and optimize infrastructure and application costs
- Define and manage SLIs/SLOs to meet reliability targets
- Continuously improve deployment, monitoring, and operational practices
- Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
- Familiarity with cloud security and compliance best practices
- Experience supporting high-availability, customer-facing systems
- Strong collaboration and communication skills