Role Overview

Design and implement multi-region, multi-AZ AWS architectures that meet RTO/RPO targets
Engineer active-active and active-passive failover patterns using Route 53, Global Accelerator, and CloudFront
Build automated DR runbooks and playbooks using AWS Systems Manager Automation and Step Functions
Implement chaos engineering practices using AWS Fault Injection Simulator (FIS) to validate resiliency
Architect cross-region replication strategies for S3, DynamoDB Global Tables, RDS, and Aurora Global
Review containerized workloads using Kubernetes, ensuring resilience through self-healing, auto-scaling, and multi-cluster or multi-region deployments.
Administer AWS Backup across all services (EC2, EBS, RDS, EFS, FSx, DynamoDB, Aurora) with policy-based automation
Design immutable backup vaults and cross-account/cross-region backup replication pipelines
Develop and automate data recovery testing procedures, ensuring integrity and meeting defined SLAs
Implement point-in-time recovery (PITR) for databases and storage; validate via regular restore drills
Maintain Business Continuity Plans (BCP) and Disaster Recovery (DR) strategies, including tracking RTO (Recovery Time Objective) and RPO (Recovery Point Objective).
Author and maintain Terraform/CloudFormation templates for all BCP/DR infrastructure components
Automate DR testing pipelines through CI/CD (CodePipeline, CodeBuild, GitHub Actions)
Write Python/Bash/PowerShell scripts to orchestrate failover, failback, and health-check workflows
Manage infrastructure state in AWS Control Tower and implement Landing Zone DR patterns
Build CloudWatch dashboards, alarms, and composite alarms for availability and DR-readiness indicators
Integrate AWS Health, Personal Health Dashboard events into PagerDuty/OpsGenie alerting workflows
Participate in on-call rotations and lead DR incident response; conduct post-incident reviews (PIRs)
Develop and maintain runbooks for AWS service degradations, regional outages, and data corruption events
Conduct regular BCP/DR tabletop exercises and full failover simulations to validate recovery procedures and improve organizational readiness, document results and action items.
Ensure DR controls meet SOC 2, ISO 22301, NIST 800-53, and HIPAA/PCI requirements as applicable
Maintain current and accurate DR documentation: BIAs, BCPs, DRP runbooks, and recovery evidence
Collaborate with audit and compliance teams to provide DR evidence and remediation tracking

Requirements

5+ years in cloud infrastructure, SRE, or IT disaster recovery engineering roles
3+ years of hands-on AWS experience in production environments at scale
Proven delivery of multi-region DR architectures with defined and tested RTO/RPO targets
Expert-level proficiency with core AWS resilience services
Strong scripting skills: Python, Bash, or PowerShell for automation and orchestration
Experience with Infrastructure as Code: Terraform and/or AWS CloudFormation
Solid understanding of networking fundamentals: VPC, TGW, Direct Connect, VPN, DNS failover
Excellent written and verbal communication; able to produce executive-level DR reports.

Tech Stack

AWS
Cloud
DNS
DynamoDB
EC2
Kubernetes
Python
Terraform

Benefits

Competitive salary
Remote work options

Cloud Reliability Engineer – Recovery

Key skills

About this role

Role Overview

Requirements

Tech Stack

Benefits