Empower is a company focused on transforming financial lives and fostering a flexible work environment. The Data Reliability Engineer will manage the reliability and operational excellence of an AWS-based data platform, ensuring data pipelines and analytics platforms meet business-critical service level agreements.
Responsibilities:
- Own the reliability and stability of production data pipelines and data platform services
- Define, improve, and enforce data SLAs/SLOs for batch and streaming products, including freshness, latency, and completeness
- Diagnose and resolve data pipeline failures, delays, and data quality issues in production environments
- Investigate issues across distributed data systems, including Spark/EMR workloads, ingestion pipelines, and warehouse performance
- Lead or support incident response, including triage, mitigation, and long-term resolution
- Perform root cause analysis and implement durable fixes to prevent recurrence
- Design and enhance monitoring, alerting, and observability for data systems
- Develop automation and tooling to reduce operational toil and improve system resilience
- Contribute to disaster recovery and resiliency planning, including backup validation and recovery workflows
- Partner with engineering teams to improve pipeline design, reliability, and operational readiness
- Create and maintain runbooks, Standard Operating Procedures, and operational documentation
- Participate in occasional off-hours support for production data systems when required
Requirements:
- Bachelor's degree in Computer Science, Information Systems, Data Science, or a related field
- 5+ years of experience in data engineering or analytics platform roles, including 3+ years operating in a production cloud data warehouse environment such as Redshift or Snowflake
- 3+ years of experience building AWS data pipelines and supporting them through production, including exposure to real-world failures and operational challenges
- 3+ years of experience working with production data platforms in AWS environments, with a focus on anomaly detection, reconciliation, and end-to-end validation
- 3+ years of experience with Python and SQL in real data systems
- Hands-on experience troubleshooting distributed data processing systems such as Spark/EMR, Redshift, and streaming systems
- Proven ability to debug and resolve production issues in data pipelines and data platforms
- Experience with AWS data services such as EMR, Redshift, DynamoDB, S3, or similar
- Proven ability to handle production incidents and perform root cause analysis
- Strong problem-solving mindset and ability to work through ambiguous production issues
- Experience handling real-world data issues such as pipeline delays or failures
- Experience with data backfills and reprocessing
- Experience with late-arriving data or incomplete datasets
- Experience improving observability and alerting specifically for data systems
- Experience influencing or guiding data pipeline reliability and operational practices
- Exposure to streaming or event-driven systems such as Kafka, Kinesis, and CDC patterns
- Experience with disaster recovery, backup validation, and resiliency testing
- Strong communication during incidents with both technical and non-technical stakeholders
- Prior FinOps or capacity-planning ownership for data platforms
- Familiarity with BI semantic layers and contract enforcement at consumption, including Looker, Power BI, or Tableau