Ethos is a well-funded Series A startup focused on transforming training to align with strategic business outcomes, serving over 150 enterprise customers. They are looking for a Senior/Staff DevOps Engineer to lead the deployment and operationalization of their SaaS products, enhance DevOps practices, and shape platform engineering strategy, particularly with AI tooling and complex data pipelines.
Responsibilities:
- Design & Operate the Platform: Architect, implement, and run secure, scalable, multi-tenant infrastructure (infra as code, immutable artifacts, GitOps)
- AI-Augmented Operations & Platform Work: Use AI coding and agentic tools (Claude Code, Cursor, Copilot, MCP-based ops agents) for IaC authoring, pipeline development, log/trace analysis, postmortem drafting, and toil reduction; build and improve agentic workflows for the team
- CI/CD & Release Engineering: Build and harden pipelines (build, test, scan, sign, promote, deploy) for multi-environment delivery—including disconnected/air-gapped workflows
- Observability & Reliability: Establish SLOs; instrument systems for metrics/logs/traces; drive incident response and postmortems; reduce MTTR and change failure rate
- Security & Compliance by Design: Integrate supply-chain security (SBOMs, signing, provenance), secrets management, and baseline hardening (CIS/STIG-aligned)
- Cost & Performance: Optimize infrastructure spend and performance (capacity planning, autoscaling, right-sizing, storage/egress strategies)
- Technical Leadership: Lead design reviews, author RFCs, mentor engineers, and raise the quality bar for platform changes
- Gov/Constrained Deployments: Support IL-4/IL-5-aligned patterns, RMF documentation support, and offline artifact promotion processes where needed
- (Staff) Strategy & Standards: Define platform roadmaps, establish consistent deployment and infrastructure patterns, and guide cross-team adoption of best practices
Requirements:
- 5+ years building and operating cloud platforms; 3+ years deploying SaaS in production
- Strong with Terraform, Helm/Kustomize, and containers (Docker, Kubernetes)
- Deep AWS experience (e.g., VPC, EKS, EC2, S3, RDS, ECR, IAM/KMS, Route 53; CloudFront desirable)
- CI/CD expertise (e.g., GitHub Actions, CircleCI, or Argo Workflows) and GitOps (Argo CD or Flux)
- Observability across metrics, logs, and traces (e.g., Prometheus/Grafana, OpenTelemetry, ELK)
- Proven track record in IaC, scalable system design, and quality tooling (automated tests, canaries/blue-green, feature flags)
- Excellent communication; comfortable partnering with Product, Security, and Customer teams
- Thrives in a startup environment—ownership, autonomy, and pragmatic delivery
- Active, fluent use of AI development/operations tools as part of your daily workflow
- Secret Clearance or eligibility and willingness to obtain one
- Supply-chain security (SBOMs, SLSA concepts, image signing, provenance) and vulnerability management (e.g., Trivy/Grype, Snyk; Chainguard experience a plus)
- Experience identifying/mitigating CVEs and setting policy thresholds
- Background with DoD/regulated customers; familiarity with IL-4/IL-5, Platform One patterns, and RMF documentation workflows
- Knowledge of STIG/CIS hardening, air-gapped architectures, and offline update mechanisms
- Experience operating AI/ML workloads in production (GPU scheduling, model artifact management, inference serving, vector DBs, queuing/streaming) or building agentic ops workflows / MCP-based integrations (alert triage, runbook automation, IaC review agents)