CopeCart is a company that helps entrepreneurs sell digital products professionally. They are seeking an SRE / DevOps Engineer to improve deployment processes, enhance system reliability, and implement engineering practices that facilitate efficient operations and debugging.
Responsibilities:
- Improve the deployment experience for our new system
- Reduce operational bottlenecks that slow down engineering and feature delivery
- Strengthen our AWS production setup, currently based on ECS and containers
- Improve our GitHub Actions CI/CD workflows
- Work with Terraform / OpenTofu to make infrastructure safer, clearer, and easier to change
- Improve production debugging across AWS, containers, networking, Linux, and application-level issues
- Improve our observability across the three pillars: metrics, logs, and traces
- Create or improve runbooks, repo instructions, service maps, deployment guides, and operational documentation
- Introduce agentic engineering workflows that help engineers diagnose issues, propose fixes, and validate changes before they reach production
- Design safe guardrails for agent-assisted work: permissions, approval gates, auditability, sandboxing, rollback procedures, and human review
Requirements:
- Strong hands-on experience in SRE, DevOps, platform engineering, infrastructure engineering, or production operations
- Production AWS experience
- Experience with ECS and containerized services
- Experience with GitHub Actions
- Experience with Terraform and/or OpenTofu
- Experience with CI/CD, Linux, networking, and production debugging
- Strong observability skills across metrics, logs, and traces
- Ability to write production-quality code or scripts in TypeScript and Bash
- Ability to read and modify infrastructure, CI/CD, and application code
- Good judgment around production risk, automation, permissions, and rollback
- Kubernetes experience
- Ruby experience
- Experience building internal developer platforms or self-service infrastructure
- Experience with coding agents, AI-assisted engineering workflows, repo-level agent instructions, evals, or agent guardrails
- Experience improving incident response, deploy safety, or on-call quality