Latitude AI develops automated driving technologies, including L3, for Ford vehicles at scale. As a Site Reliability Engineer, you will be responsible for building and running mission-critical systems, ensuring their health, reliability, and performance through monitoring and automation.
Responsibilities:
- Build monitoring to ensure our platform is healthy and its reliability measurable
- Build alerting and a set of runbooks to enable faster detection and remediation of platform issues
- Debug complex issues that may combine multiple components of the stack and ensure proper fixes are implemented to prevent these issues from happening again
- Participate in an on-call rotation and culture of continuous improvement through blameless postmortems
- Design and implement components of the platform to enable features that make the work of our customers possible, simpler and more efficient
- Build Kubernetes controllers to automate operations
Requirements:
- Bachelor's degree in Computer Engineering, Computer Science, Electrical Engineering, Robotics or a related field and 4+ years of relevant experience (or Master's degree and 2+ years of relevant experience, or PhD)
- Fundamental understanding of Linux operating system internals, TCP/IP networking, and storage subsystems
- Hands on development in Go or Python to create robust software that can run reliably in production
- Strong experience scaling and securing services in the cloud (AWS, GCP) or cloud native environments
- Experience using infrastructure-as-code principles to automate the creation of infrastructure resources (e.g. Terraform, CloudFormation)
- Experience authoring and maintaining Kubernetes Controllers in Go
- Experience running Kubernetes and related core components in a large-scale, production environment
- Experience with metrics (e.g. Prometheus), logging (e.g. Elasticsearch, Loki) and tracing (e.g. Jaeger, Tempo) systems
- Understanding of engineering design limitations and ability to provide guidance to teams to scale their services to achieve desired performance within budget
- A focus on increasing service reliability through defining and adhering to SLOs
- Strong communication skills and the ability to work effectively in a diverse and distributed team