Optum is focused on transforming health care through innovative technology solutions. The Site Reliability Engineer will architect and maintain cloud environments, working closely with software engineers and DevOps teams to ensure a secure and high-performance infrastructure.

Responsibilities:

Build, operate, and support IaaS and PaaS infrastructure in Azure and AWS commercial and government cloud environments under established architecture and standards
Partner with development teams to help define, track, and report on SLIs, SLOs, and SLAs
Contribute to the development and support of platform services, including provisioning, configuration, deployment, and day to day operations
Integrate applications and platforms with centralized logging, monitoring, metrics, and incident management systems
Configure and maintain observability tools (dashboards, APMs, alerts) to help engineering teams safely operate applications in production
Participate in an on-call rotation to support software and cloud infrastructure, following documented runbooks and escalation paths
Support root cause analysis efforts and assist with remediation by implementing automation, monitoring improvements, and reliability fixes
Maintain and enhance operational tooling, scripts, and frameworks used for platform and service support
Execute performance and resiliency testing for platform services using existing frameworks and tools
Configure and tune alerts related to performance, availability, cost, security, and compliance signals
Follow and help improve operational processes, contributing automation to reduce manual and repetitive support activities

Requirements:

4+ years of experience working in a Site Reliability Engineering, Cloud Engineering, or DevOps role
Hands-on experience supporting Kubernetes (managed or bare metal) clusters in production environments
Some hands-on experience with monitoring and observability tools (e.g., Azure Monitor, Splunk, Dynatrace, Grafana, Prometheus)
Experience using Infrastructure as Code (IaC) tools such as Terraform or Pulumi
Experience supporting infrastructure and applications in production cloud environments
Experience interacting with or supporting systems that expose RESTful APIs
Solid working knowledge of at least one major cloud service provider (Azure preferred, AWS acceptable)
Working knowledge of networking fundamentals and common internet protocols
Understanding of identity and access management (IAM) concepts and best practices
Basic understanding of security concepts including encryption, PKI, and common application security risks (e.g., OWASP)
Familiarity with Kubernetes deployment and GitOps tools such as Helm, ArgoCD, or Flux
Familiarity with IDEs and source control tools such as Visual Studio Code, GitHub, GitLab
Ability to participate in a 24/7 on-call rotation following documented procedures and escalation paths
United States Citizenship
If you are offered this position, you will be required to provide extensive personal information to obtain and maintain a suitability or determination of eligibility for a Confidential/Secret or Top Secret security clearance as a condition of your employment
All employees working remotely will be required to adhere to UnitedHealth Group's Telecommuter Policy

Site Reliability Engineer - Remote

Key skills

About this role

Responsibilities:

Requirements: