Latitude AI develops automated driving technologies, including L3, for Ford vehicles at scale. As a Site Reliability Engineer, you will be responsible for building and running mission-critical systems, ensuring their health, reliability, and performance through monitoring and automation.

Responsibilities:

Build monitoring to ensure our platform is healthy and its reliability measurable
Build alerting and a set of runbooks to enable faster detection and remediation of platform issues
Debug complex issues that may combine multiple components of the stack and ensure proper fixes are implemented to prevent these issues from happening again
Participate in an on-call rotation and culture of continuous improvement through blameless postmortems
Design and implement components of the platform to enable features that make the work of our customers possible, simpler and more efficient
Build Kubernetes controllers to automate operations

Requirements:

Bachelor's degree in Computer Engineering, Computer Science, Electrical Engineering, Robotics or a related field and 4+ years of relevant experience (or Master's degree and 2+ years of relevant experience, or PhD)
Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems
Hands on development in Go or Python to create robust software that can run reliably in production
Strong experience scaling and securing services in the cloud (AWS, GCP) or cloud native environments
Experience using infrastructure-as-code principles to automate the creation of infrastructure resources (e.g. Terraform, CloudFormation)
Experience authoring and maintaining Kubernetes Controllers in Go
Experience running Kubernetes and related core components in a large-scale, production environment
Experience with metrics (e.g. Prometheus), logging (e.g. Elasticsearch, Loki) and tracing (e.g. Jaeger, Tempo) systems
Understanding of engineering design limitations and ability to provide guidance to teams to scale their services to achieve desired performance within budget
A focus on increasing service reliability through defining and adhering to SLOs
Strong communication skills and the ability to work effectively in a diverse and distributed team

Senior Site Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: