H1 is dedicated to providing optimal healthcare information access and is seeking a Staff Data Engineer for their Emerald team. This role involves leading the architecture and scalability of H1’s healthcare entity resolution platform while managing a small team and collaborating with various stakeholders to enhance the platform's efficiency and accuracy.
Responsibilities:
- Lead the design, optimization, and scalability of distributed Spark/PySpark pipelines powering entity resolution and large-scale healthcare data processing
- Own systems supporting automatching, identity mapping, grouping logic, deduplication, enrichment, and auto-approval workflows across healthcare provider and organization datasets
- Build and maintain scalable processing frameworks for PubMed, clinical trial, ct.gov, conference, and other healthcare data sources
- Drive infrastructure optimization initiatives focused on improving throughput, runtime, observability, and cloud compute cost efficiency
- Partner closely with AI/ML teams to integrate matching and resolution models into EMERALD and improve matching precision and recall
- Lead complex technical initiatives from architecture and design through deployment, monitoring, and long-term production support
- Serve as a technical leader and mentor across the team through code reviews, technical guidance, and engineering best practices
- Collaborate directly with Product and business stakeholders to align technical solutions with operational and customer needs
- Support production operations, incident response, troubleshooting, and ongoing platform reliability
Requirements:
- 8+ years of experience building and maintaining large-scale distributed data systems and pipelines
- Demonstrated technical leadership experience mentoring engineers and driving complex technical initiatives
- Extensive experience with Apache Spark and AWS-based big data technologies including EMR, S3, and distributed compute environments
- Strong coding experience in Python (PySpark), Scala, Java, or equivalent languages used for distributed processing systems
- Experience optimizing large-scale Spark workloads for performance, scalability, and infrastructure cost efficiency
- Experience with streaming and event-driven architectures using technologies such as Kafka or Spark Streaming
- Experience with orchestration and lakehouse technologies such as Argo and Hudi or comparable platforms
- Experience with containerization and infrastructure technologies such as Docker, Kubernetes, and Terraform
- Experience working with relational or distributed databases such as PostgreSQL or Redshift
- Proven ability to operate effectively within highly scalable, production-grade distributed systems
- Deep expertise with distributed data processing frameworks such as Apache Spark and Hadoop, particularly within AWS environments
- Strong proficiency in Python (PySpark), Scala, Java, or other modern programming languages used for large-scale distributed processing
- Experience building scalable ETL/ELT frameworks across both batch and streaming architectures
- Strong understanding of distributed file formats including Apache Parquet and Apache AVRO
- Experience with streaming technologies such as Kafka, Spark Streaming, or KSQL
- Strong grasp of software engineering fundamentals including distributed systems, data structures, concurrency, and system design
- Experience performing root cause analysis across large-scale distributed systems and complex data pipelines
- Ability to write clean, maintainable, modular, and production-grade code
- Experience improving performance, scalability, observability, and infrastructure efficiency within distributed systems
- Strong communication and collaboration skills across both technical and non-technical stakeholders
- Familiarity with modern development and infrastructure tooling including Git, CI/CD pipelines, Docker, Kubernetes, Terraform, Argo, Hudi, and JIRA
- Experience with entity resolution, identity mapping, automatching, deduplication, or large-scale matching systems is strongly preferred
- Experience working with healthcare, life sciences, Real World Evidence (RWE), or large-scale healthcare datasets is strongly preferred