AnsibleAWSAzureCloudCyber SecurityGoogle Cloud PlatformGrafanaJenkinsPrometheusPythonTerraformGoGolangAIMLGenerative AILLMLarge Language ModelsGCPGoogle CloudGitHub ActionsGitHubAgileCI/CDCommunication
About this role
Role Overview
Own Reliability Engineering: Define and drive reliability standards for cybersecurity platforms — establishing SLIs, SLOs, and error budgets; identifying systemic weaknesses; and engineering solutions that improve uptime, latency, and fault tolerance.
Write Code and Build Automation: Develop production-quality software in Python (required) and Golang (preferred) to automate operational workflows, build internal tooling, eliminate toil, and improve the day-to-day velocity of security engineering teams.
Partner with Developers and Infrastructure Engineers: Work closely with software engineers and infrastructure teams to review system designs for reliability, provide feedback on deployability and operability, and ensure that what gets built can be confidently operated and maintained in production.
Drive Observability: Instrument security platforms and pipelines with meaningful metrics, logs, and traces; build dashboards and alerting that give the team real operational visibility using tools like Grafana, Prometheus, and similar observability stacks.
Lead Incident Response and Post-Mortems: Be a first-responder for production issues affecting security systems; drive structured incident response, coordinate resolution, and produce blameless post-mortems with actionable follow-through to prevent recurrence.
Build and Maintain CI/CD & Infrastructure as Code: Develop and own deployment pipelines (GitHub Actions, Jenkins) and infrastructure automation (Terraform, Ansible) that enable safe, repeatable, and fast delivery of security platform changes.
Improve Security Platform Performance: Profile, benchmark, and tune security services, detection pipelines, and data ingestion workflows — identifying bottlenecks and shipping targeted improvements that matter.
Contribute Actively in Agile: Be a high-output contributor in a fast-moving agile squad: write code every sprint, engage in design and architecture reviews, participate in code reviews, and help the team maintain quality and momentum.
Apply Object-Oriented Engineering Fundamentals: Write clean, testable, and maintainable code using strong OOP principles and SOLID patterns — because operability starts with code quality.
Explore AI/ML & LLMs (Plus): Apply knowledge of AI/ML development, large language models, or generative AI to identify practical opportunities in anomaly detection, alert triage automation, or operational intelligence.
Share Knowledge: Contribute to technical discussions, participate in code reviews, and share operational insights with developers and infrastructure partners — not as a formal mandate, but as a natural part of working on a great engineering team.
Requirements
8+ years of professional engineering experience spanning software development and site reliability / platform engineering.
5+ years in SRE, DevOps, or platform engineering roles with a strong software development component.
4+ years working in cloud-native environments (AWS, Azure, or GCP).
3+ years delivering within agile teams in a high-velocity environment.
Python Expertise (Required): Demonstrated production-level Python development — used for automation, tooling, and operational software.
Golang Proficiency (Preferred): Hands-on Golang experience, especially in systems tooling, infrastructure software, or performance-sensitive services.
SRE / Platform Engineering Foundation: Proven background in site reliability engineering, platform engineering, or DevOps with a strong software development component — not purely operations.
Object-Oriented Design: Applied knowledge of OOP design patterns and SOLID principles demonstrated through production code and tooling.
Observability & Monitoring: Hands-on experience with Grafana, Prometheus, or equivalent; able to design meaningful SLIs/SLOs, build useful dashboards, and write alerts that reduce noise rather than add to it.
Incident Response: Experience leading structured incident response, conducting blameless post-mortems, and driving systemic follow-through on reliability improvements.
CI/CD & Infrastructure as Code: Proficiency with CI/CD pipelines (GitHub Actions, Jenkins) and IaC tooling (Terraform, Ansible); experience enabling fast, safe, and repeatable deployments.
Cloud Proficiency: Hands-on experience with AWS, Azure, or GCP; familiarity with cloud-native reliability and infrastructure patterns.
Agile Team Contributor: Comfortable delivering consistently within a high-velocity agile team; strong bias toward iterative delivery and fast feedback.
Security Domain Familiarity (Preferred): Exposure to security platforms, SIEMs, EDRs, detection pipelines, or vulnerability management tooling; DevSecOps experience is a strong plus.
AI/ML & LLM Experience (Plus): Working knowledge of AI/ML development or applied experience with LLMs and generative AI, particularly for operational intelligence or anomaly detection use cases.
Communication: Able to communicate clearly with both developers and infrastructure engineers; bridges technical disciplines without jargon overload.
Tech Stack
Ansible
AWS
Azure
Cloud
Cyber Security
Google Cloud Platform
Grafana
Jenkins
Prometheus
Python
Terraform
Go
Benefits
Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being.
Financial benefits including market-competitive compensation.
A 401K savings plan vested from day one that offers a 6% match.
Performance and recognition-based incentives.
Tuition assistance.
Access to additional benefits like mental healthcare as well as fertility and adoption assistance.
Supports flexibility
We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year.
Staff Cyber Site Reliability Engineer – SRE at GEICO | JobVerse