Medeloop is a company focused on improving clinical research and healthcare outcomes. They are seeking a Senior Site Reliability Engineer to ensure the reliability, scalability, and performance of their platform, blending DevOps engineering with SRE discipline to maintain system uptime and operational excellence.
Responsibilities:
- Design, implement, and manage scalable, secure, and highly available cloud infrastructure on AWS - infrastructure as code (IaC) using AWS CDK, CloudFormation, or Terraform, ensuring all environments are version-controlled and reproducible
- Architect multi-region and disaster recovery strategies that meet healthcare uptime requirements
- Manage containerized workloads using Docker and Kubernetes, optimizing for cost, performance, and resilience
- Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across all production services
- Build and maintain observability stacks (DataDog, AWS CloudWatch, Sentry) covering metrics, logs, traces, and alerting
- Lead incident response: triage, mitigate, and drive blameless post-incident reviews with actionable follow-ups
- Conduct capacity planning and performance engineering to ensure the platform scales ahead of demand
- Champion error budgets and use them to balance feature velocity with system stability
- Identify, assess, and mitigate operational risks by collaborating with engineering and product teams to evaluate impact and likelihood before they become incidents
- Participate in and help structure an on-call rotation, ensuring clear escalation paths and fair distribution of after-hours coverage
- Build self-service tooling and runbooks that reduce toil and empower development teams to ship independently
- Design and maintain CI/CD pipelines (GitHub Actions) that enable fast, safe, and repeatable deployments
- Automate security scanning (SAST, DAST) within pipelines and collaborate with engineering to remediate findings
- Implement progressive delivery strategies such as canary deployments, blue-green releases, and feature flags
- Proficiency in scripting languages (Python, Bash) for automation, troubleshooting, and building reliability tooling
- Track and drive down operational toil, targeting less than 50% of team time spent on repetitive manual work
- Evaluate and manage change risk for production deployments, maintaining change review processes that balance speed with stability
- Ensure infrastructure meets healthcare compliance standards (HIPAA, SOC 2) through policy-as-code, encryption, and access controls
- Manage networking security (VPCs, subnets, security groups, WAFs) and identity/authentication systems (AWS Cognito, Auth0, OAuth2, SSO)
- Conduct regular security reviews, vulnerability assessments, and patching across the infrastructure estate
- Partner closely with product and engineering teams to embed reliability thinking into the software development lifecycle
- Develop and maintain comprehensive documentation for infrastructure, runbooks, and operational playbooks
- Mentor junior engineers on DevOps and SRE best practices, fostering a culture of ownership and continuous improvement
- Stay current with advancements in cloud technologies, DevOps tooling, and SRE methodologies
- Own and evolve internal developer platform tooling — including deployment workflows (GitOps/Flux), bug tracking integrations, and developer self-service portals
Requirements:
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field
- 7+ years of combined experience in DevOps and/or Site Reliability Engineering roles, with at least 2 years in a senior capacity
- Deep proficiency with AWS services
- Deep experience with observability and monitoring platforms such as DataDog, AWS CloudWatch, and Sentry
- Strong experience building and maintaining CI/CD pipelines with GitHub Actions or equivalent tools
- Expertise in infrastructure as code using AWS CDK, CloudFormation, or Terraform
- Hands-on experience with containerization (Docker) and orchestration (Kubernetes)
- Proven track record of defining and operating against SLOs/SLIs and managing incident response processes
- Solid understanding of networking (VPCs, subnets, load balancing, DNS), security, and compliance best practices
- Experience with authentication and authorization systems including AWS Cognito, Auth0, OAuth2, and SSO
- Proactive, self-directed mindset with a bias toward action and taking initiative
- Excellent problem-solving skills and the ability to work independently as well as collaboratively across teams
- Strong communication skills—able to explain complex infrastructure decisions clearly to technical and non-technical stakeholders
- Passion for unsolved challenges in healthcare AI, with the ability to thrive in a fast-paced, multidisciplinary environment and wear multiple hats
- Multi-cloud experience (AWS, Azure, GCP)
- Familiarity with healthcare data standards, compliance, and protocols such as HIPAA, HL7 FHIR, OMOP, and i2b2
- Experience with chaos engineering practices and tools (e.g., AWS Fault Injection Simulator, Gremlin)
- Prior experience in a healthcare or life sciences company operating under strict regulatory requirements
- Contributions to open-source infrastructure or SRE tooling
- Relevant certifications such as AWS Solutions Architect, Certified Kubernetes Administrator (CKA), or Google SRE certification