Own reliability across the full path from vehicle to customer: AUV onboard compute (Jetson-class modules, ROS 2), topside/operator systems, cloud data pipelines, and the platform that delivers data products.
Build and extend infrastructure automation
provisioning, configuration management, deployment, and self-recovery
so that routine field operations and pipeline runs require minimal manual intervention.
Design and improve observability: metrics, logging, tracing, and alerting that give both robotics and data teams early, actionable signal across vehicle fleets and cloud services.
Drive down on-call burden by identifying and eliminating single points of failure, writing runbooks, and automating the manual steps that currently require tribal knowledge.
Participate in a shared on-call rotation covering both robotics-side and cloud-side incidents in 12-hour shifts spanning European and East Coast business hours; lead and contribute to blameless post-incident reviews.
Define and track reliability targets, availability, data yield, recovery time, tied to continuous-operations goals, and partner with robotics and data teams to meet them.
Manage cloud infrastructure on AWS (compute, storage, networking, IaC, cost, and security posture) for data processing and platform workloads.
Improve fleet
and vehicle-level configuration management, deployment safety, and rollback so changes reach the field reliably and predictably.
Requirements
5+ years in an SRE, DevOps, or infrastructure engineering role running production systems with real uptime and on-call responsibilities, including senior-level ownership of reliability outcomes.
Experience implementing a scalable incident management and operational excellence mechanism that treats operators as customers, building processes and tooling that serve the people running operations day to day, not just the engineering team.
Strong automation instincts: comfortable scripting and building tooling in Python and/or Go and Bash, and using infrastructure-as-code (Terraform or equivalent).
Hands-on AWS experience across compute, storage, networking, and IAM, plus containerization and orchestration (Docker, Kubernetes or similar).
Working knowledge of Linux internals, networking, and observability tooling (Prometheus/Grafana or equivalents).
Comfort operating across environments that aren’t just cloud: embedded or edge compute, intermittent connectivity, and physical systems that fail in messy ways.
A reliability mindset: you instrument before you guess, you automate the second time you do something manually, and you write things down so the next person or the system can handle it without you.
Strong ownership and communication in a small, fast-moving team.