Empower is a company focused on transforming financial lives and fostering a flexible work environment. The Data Reliability Engineer will manage the reliability and operational excellence of an AWS-based data platform, ensuring data pipelines and analytics platforms meet business-critical service level agreements.

Responsibilities:

Own the reliability and stability of production data pipelines and data platform services
Define, improve, and enforce data SLAs/SLOs for batch and streaming products, including freshness, latency, and completeness
Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments
Investigate issues across distributed data systems, including Spark/EMR workloads, ingestion pipelines, and warehouse performance
Lead or support incident response, including triage, mitigation, and long-term resolution
Perform root cause analysis and implement durable fixes to prevent recurrence
Design and enhance monitoring, alerting, and observability for data systems
Develop automation and tooling to reduce operational toil and improve system resilience
Contribute to disaster recovery and resiliency planning, including backup validation and recovery workflows
Partner with engineering teams to improve pipeline design, reliability, and operational readiness
Create and maintain runbooks, Standard Operating Procedures, and operational documentation
Participate in occasional off-hours support for production data systems when required

Requirements:

Bachelor's degree in Computer Science, Information Systems, Data Science, or a related field
5+ years of experience in data engineering or analytics platform roles, including 3+ years operating in a production cloud data warehouse environment such as Redshift or Snowflake
3+ years of experience building AWS data pipelines and supporting them through production, including exposure to real-world failures and operational challenges
3+ years of experience working with production data platforms in AWS environments, with a focus on anomaly detection, reconciliation, and end-to-end validation
3+ years of experience with Python and SQL in real data systems
Hands-on experience troubleshooting distributed data processing systems such as Spark/EMR, Redshift, and streaming systems
Proven ability to debug and resolve production issues in data pipelines and data platforms
Experience with AWS data services such as EMR, Redshift, DynamoDB, S3, or similar
Proven ability to handle production incidents and perform root cause analysis
Strong problem-solving mindset and ability to work through ambiguous production issues
Experience handling real-world data issues such as pipeline delays or failures
Experience with data backfills and reprocessing
Experience with late-arriving data or incomplete datasets
Experience improving observability and alerting specifically for data systems
Experience influencing or guiding data pipeline reliability and operational practices
Exposure to streaming or event-driven systems such as Kafka, Kinesis, and CDC patterns
Experience with disaster recovery, backup validation, and resiliency testing
Strong communication during incidents with both technical and non-technical stakeholders
Prior FinOps or capacity-planning ownership for data platforms
Familiarity with BI semantic layers and contract enforcement at consumption, including Looker, Power BI, or Tableau

Data Reliability Engineer

Key skills

About this role

Responsibilities:

Requirements: