Medeloop is a company focused on improving clinical research and healthcare outcomes. They are seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and performance of their platform, blending DevOps engineering with SRE discipline to maintain system uptime and operational excellence.

Responsibilities:

Design, implement, and manage scalable, secure, and highly available cloud infrastructure on AWS - infrastructure as code (IaC) using AWS CDK, CloudFormation, or Terraform, ensuring all environments are version-controlled and reproducible
Architect multi-region and disaster recovery strategies that meet healthcare uptime requirements
Manage containerized workloads using Docker and Kubernetes, optimizing for cost, performance, and resilience
Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across all production services
Build and maintain observability stacks (DataDog, AWS CloudWatch, Sentry) covering metrics, logs, traces, and alerting
Lead incident response: triage, mitigate, and drive blameless post-incident reviews with actionable follow-ups
Conduct capacity planning and performance engineering to ensure the platform scales ahead of demand
Champion error budgets and use them to balance feature velocity with system stability
Identify, assess, and mitigate operational risks by collaborating with engineering and product teams to evaluate impact and likelihood before they become incidents
Participate in and help structure an on-call rotation, ensuring clear escalation paths and fair distribution of after-hours coverage
Build self-service tooling and runbooks that reduce toil and empower development teams to ship independently
Design and maintain CI/CD pipelines (GitHub Actions) that enable fast, safe, and repeatable deployments
Automate security scanning (SAST, DAST) within pipelines and collaborate with engineering to remediate findings
Implement progressive delivery strategies such as canary deployments, blue-green releases, and feature flags
Proficiency in scripting languages (Python, Bash) for automation, troubleshooting, and building reliability tooling
Track and drive down operational toil, targeting less than 50% of team time spent on repetitive manual work
Evaluate and manage change risk for production deployments, maintaining change review processes that balance speed with stability
Ensure infrastructure meets healthcare compliance standards (HIPAA, SOC 2) through policy-as-code, encryption, and access controls
Manage networking security (VPCs, subnets, security groups, WAFs) and identity/authentication systems (AWS Cognito, Auth0, OAuth2, SSO)
Conduct regular security reviews, vulnerability assessments, and patching across the infrastructure estate
Partner closely with product and engineering teams to embed reliability thinking into the software development lifecycle
Develop and maintain comprehensive documentation for infrastructure, runbooks, and operational playbooks
Mentor junior engineers on DevOps and SRE best practices, fostering a culture of ownership and continuous improvement
Stay current with advancements in cloud technologies, DevOps tooling, and SRE methodologies
Own and evolve internal developer platform tooling — including deployment workflows (GitOps/Flux), bug tracking integrations, and developer self-service portals

Requirements:

Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
7+ years of combined experience in DevOps and/or Site Reliability Engineering roles, with at least 2 years in a senior capacity
Deep proficiency with AWS services
Deep experience with observability and monitoring platforms such as DataDog, AWS CloudWatch, and Sentry
Strong experience building and maintaining CI/CD pipelines with GitHub Actions or equivalent tools
Expertise in infrastructure as code using AWS CDK, CloudFormation, or Terraform
Hands-on experience with containerization (Docker) and orchestration (Kubernetes)
Proven track record of defining and operating against SLOs/SLIs and managing incident response processes
Solid understanding of networking (VPCs, subnets, load balancing, DNS), security, and compliance best practices
Experience with authentication and authorization systems including AWS Cognito, Auth0, OAuth2, and SSO
Proactive, self-directed mindset with a bias toward action and taking initiative
Excellent problem-solving skills and the ability to work independently as well as collaboratively across teams
Strong communication skills—able to explain complex infrastructure decisions clearly to technical and non-technical stakeholders
Passion for unsolved challenges in healthcare AI, with the ability to thrive in a fast-paced, multidisciplinary environment and wear multiple hats
Multi-cloud experience (AWS, Azure, GCP)
Familiarity with healthcare data standards, compliance, and protocols such as HIPAA, HL7 FHIR, OMOP, and i2b2
Experience with chaos engineering practices and tools (e.g., AWS Fault Injection Simulator, Gremlin)
Prior experience in a healthcare or life sciences company operating under strict regulatory requirements
Contributions to open-source infrastructure or SRE tooling
Relevant certifications such as AWS Solutions Architect, Certified Kubernetes Administrator (CKA), or Google SRE certification

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: