Ellucian is a company that powers innovation for higher education, serving over 21 million students globally. They are seeking a Senior Site Reliability Engineer to ensure the reliability, performance, and cost-efficiency of their production systems, focusing on DevOps practices and incident management.

Responsibilities:

Own and improve system reliability, availability, and performance for production environments
Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
Perform detailed root cause analysis (RCA) and drive permanent resolutions
Partner with engineering and DevOps teams to build scalable, resilient infrastructure
Automate operational processes to improve efficiency and reduce risk
Analyze and optimize infrastructure and application costs
Define and manage SLIs/SLOs to meet reliability targets
Continuously improve deployment, monitoring, and operational practices

Requirements:

5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
Experience with cloud platforms (AWS, Azure, or GCP)
Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
Experience with containers and orchestration (Docker, Kubernetes)
Scripting or programming experience (Python, Bash, or similar)
Proven ability to analyze and optimize cloud costs
Own and improve system reliability, availability, and performance for production environments
Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
Perform detailed root cause analysis (RCA) and drive permanent resolutions
Partner with engineering and DevOps teams to build scalable, resilient infrastructure
Automate operational processes to improve efficiency and reduce risk
Analyze and optimize infrastructure and application costs
Define and manage SLIs/SLOs to meet reliability targets
Continuously improve deployment, monitoring, and operational practices
Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
Familiarity with cloud security and compliance best practices
Experience supporting high-availability, customer-facing systems
Strong collaboration and communication skills

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: