Own and evolve our AWS infrastructure across compute, networking, storage, and managed services
Design and maintain infrastructure that supports high availability, predictable performance, and financial correctness
Lead platform-level architectural decisions, including service migrations and runtime changes (e.g., Redis → Valkey, EKS → ECS/Fargate)
Ensure infrastructure choices align with reliability, cost, and operational simplicity—not just trend adoption
Design and maintain deployment pipelines that are safe, repeatable, and observable
Own system reliability through capacity planning, failure modeling, and controlled change management
Lead incident response and root-cause analysis for infrastructure-level failures
Participate in on-call rotations and continuously improve operational ergonomics
Build and maintain strong observability across infrastructure and services (metrics, logs, tracing, alerting)
Ensure secure configuration of AWS resources, IAM policies, secrets management, and network boundaries
Proactively identify infrastructure risks related to scale, cost, or security and address them before they become incidents
Partner closely with application engineers to ensure platform constraints and capabilities are well understood
Drive infrastructure changes through hands-on implementation
Establish standards and best practices for infrastructure, deployment, and operations as the team grows
Mentor other platform engineers and help raise the overall operational maturity of the organization
Requirements
8+ years of experience building and operating production infrastructure in cloud environments
Deep experience with AWS core services (EC2, ECS/EKS, VPC, IAM, RDS, ElastiCache, ALB/NLB, CloudWatch, etc.)
Strong understanding of containerized workloads and orchestration tradeoffs
Proven experience designing systems for high availability, fault tolerance, and controlled failure
Hands-on experience with infrastructure as code (Terraform, CloudFormation, or equivalent)
Demonstrated ability to plan and execute infrastructure migrations safely
Experience debugging real production incidents involving networking, scaling, or service degradation.
Tech Stack
AWS
Cloud
EC2
Redis
Terraform
Benefits
Generous PTO and company holiday policy + company paid Short Term Disability
100% employer covered health and dental insurance for our direct employees (a set plan is covered, with higher tier healthcare coverage available at employee’s additional cost; dependent coverage is at employee’s cost); vision plan available at employee’s additional cost