CVS Health is building a world of health around every individual, and they are seeking a Principal Software Engineer to lead production engineering and operational excellence for their AI Platform. This role focuses on ensuring AI services are scalable, observable, and resilient while defining best practices and collaborating across teams to meet high availability and performance standards.

Responsibilities:

Own and evolve production operations strategy for AI/ML platforms and services
Define SLOs, SLIs, and error budgets for AI systems (online & batch/inference pipelines)
Lead root cause analysis (RCA) and drive systemic improvements post-incident
Establish operational readiness standards for launching new AI capabilities
Build frameworks for on-call excellence, incident response, and escalation
Design and implement end-to-end observability systems across AI workloads:
Model performance monitoring
Data pipeline health
Infrastructure metrics
Build and scale monitoring and alerting frameworks using modern tooling (e.g., Prometheus, Grafana, OpenTelemetry, Datadog, Azure Monitor, etc.)
Define actionable, low-noise alerts tied to business and system impact
Develop dashboards and telemetry standards for real-time visibility across services
Drive adoption of golden signals (latency, errors, throughput, saturation) in AI systems
Ensure reliable deployment and operation of:
Real-time inference services
Model pipelines (training, validation, deployment)
Data ingestion and feature pipelines
Implement model observability (drift detection, data skew, performance degradation)
Partner with ML engineers to improve production readiness of models
Establish lifecycle standards for models in production environments
Build internal platforms and tooling for:
Automated incident detection and response
Self-healing systems
Deployment validation and canarying
Drive Infrastructure as Code (IaC) and policy automation
Improve system resilience through chaos testing and fault injection
Act as a trusted technical advisor across platform, ML, and product teams
Set direction for operational excellence in AI systems at org scale
Mentor senior engineers and influence cross-team architectural decisions
Lead adoption of industry best practices in reliability engineering and observability

Requirements:

10+ years in software engineering, production engineering, or SRE roles
Deep experience operating large-scale distributed systems in production
Proven track record building monitoring, observability, and alerting systems
Strong expertise in incident management and production support models
Experience working with cloud platforms (Azure, AWS, GCP)
Experience supporting AI/ML platforms or data-intensive systems
Familiarity with model lifecycle management and MLOps practices
Knowledge of: OpenTelemetry, Prometheus, Grafana, Datadog
Kubernetes and containerized workloads
Streaming systems (Kafka, Event Hub, etc.)
Experience defining and implementing SLO-driven engineering
Background in high-availability, low-latency systems

Principal Software Engineer – AI Platform (Production Engineering / Reliability)

Key skills

About this role

Responsibilities:

Requirements: