Design and maintain highly available, scalable, and secure infrastructure across AWS and hybrid environments
Build and manage cloud platforms and services such as EKS, EC2, RDS/Aurora, Lambda, API Gateway, CloudFront, WAF, and IAM
Develop and maintain Infrastructure as Code using Terraform and automation tools to standardize deployments and reduce manual tasks
Set up and improve monitoring, logging, alerting, dashboards, and runbooks to support strong observability and faster incident response
Define and track service reliability metrics such as SLIs, SLOs, and error budgets to improve uptime and system performance
Lead incident analysis, root cause investigations, post-incident reviews, and disaster recovery planning and testing
Support CI/CD pipelines and work with development teams to improve deployment processes, system operability, and release reliability
Partner with security, infrastructure, and product teams to implement best practices in access control, secrets management, compliance, and cost optimization
Requirements
At least 6+ years of experience in Site Reliability Engineering, DevOps, Cloud Engineering, Platform Engineering, or Systems Engineering
Strong hands-on experience with AWS and cloud-native infrastructure in production environments
Solid experience with Terraform and Infrastructure as Code for provisioning and managing infrastructure
Strong background in Linux/Unix systems administration and production support
Experience with automation or scripting using Python, Java, or similar languages
Good understanding of networking, system security, distributed systems, and cloud architecture best practices
Experience working with Kubernetes or EKS, including deployment, scaling, and platform operations
Proven ability to support business-critical or high-availability platforms with a focus on reliability, performance, and recovery.