Implement the technical roadmap for Infrastructure as Code (IaC), CI/CD evolution, and cloud-native architecture to support TrueML’s scaling needs.
Design, develop, and maintain self-service internal platforms to reduce developer cognitive load, enabling feature teams to deploy and manage services with minimal friction at increased velocity.
Act as a core steward for cloud spend (AWS), proactively identifying and driving cost-optimization initiatives across our infrastructure.
Build and maintain infrastructure architecture that supports strict High Availability (HA) requirements and robust Disaster Recovery (DR) protocols across multiple regions.
Implement and evolve comprehensive monitoring, logging, and distributed tracing systems, leveraging AIOps to move from reactive to predictive system maintenance.
Requirements
Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
6+ years of experience in DevOps, Site Reliability Engineering (SRE), or Software Engineering, working within high-performing senior engineering teams.
Expert-level mastery with AWS and hands-on experience managing multi-region, high-availability deployments.
Advanced experience with Kubernetes (K8s) and Docker, including cluster management, networking, and scaling in production environments.
High proficiency in Terraform to drive consistency and automation across all infrastructure layers (Experience with Atlantis is a plus).
Deep experience designing and maintaining complex pipelines (GitHub Actions, GitLab CI, or Jenkins) and mastery of scripting languages like Python, Go, or Bash.
Hands-on experience with modern monitoring, observability, and tracing stacks (Datadog, Observe) and a firm grasp of SRE principles (SLIs/SLOs/Error Budgets).
Experience acting as an Incident Commander or critical responder for high-severity outages.
Experience integrating AI-assisted productivity tools (Cline, GitHub Copilot) into your engineering workflow to accelerate delivery, troubleshooting, and system monitoring.
Tech Stack
AWS
Cloud
Docker
Jenkins
Kubernetes
Python
Terraform
Go
Benefits
Flexible vacation
Medical/dental/vision insurance
Traditional/Roth retirement savings options
Company-paid disability and life insurance
Flexible Spending Account & Limited FSA
Family-friendly parental leave, volunteer and voting time off
On-demand wellness platform access for you and 5 friends and family
PerkSpot discount program for 900+ merchants nationwide