The START Center for Cancer Research is the world’s largest early phase site network dedicated to oncology clinical research. They are hiring a Sr. Data Engineer to build the data infrastructure for their Enterprise Data Platform, focusing on developing data pipelines and ensuring data quality and lineage for integrated systems.
Responsibilities:
- Build and maintain ingestion pipelines from source systems (OnCore, NetSuite, HubSpot, Microsoft Lists, Snowflake, FileMaker) into ADLS/Databricks
- Implement incremental load patterns, change data capture, and idempotent pipeline design to ensure reliability
- Design/Implement Metadata, Data Quality and Data Lineage capabilities
- Design/Implement AccessControl/RBAC capabilities
- Design/Implement DataRights/Licensing capabilities
- Develop the ETL/ELT processes that feed Lakehouse/Relational/Warehouse modeling requirements (matching, deduplication, golden record assembly) as designed by the MDM Lead
- Build publication pipelines that push canonical data models from the Data Platform back to spoke systems, coordinating with Integration Platform (Boomi based) and Messaging Platform (Azure Service Bus based) Teams
- Implement Lakehouse/Relational/Warehouse tables for golden records across priority entities: Study/Protocol, Customer/Sponsor, Item/Charge Code, and Contract
- Build the matching and survivorship logic based on rules defined by data model requirements and validated by business stakeholders
- Implement versioning, lineage and audit trails using Delta Lake time travel capabilities for full traceability of master data changes
- Configure Unity Catalog for data governance, access controls, and lineage tracking
- Build automated data quality checks at ingestion, transformation, and publication stages
- Develop data quality dashboards and alerting (integrate with existing monitoring tools or build in Databricks SQL)
- Implement reconciliation count checks between source systems and the hub to detect drift or sync failures
- Create exception handling pipelines that surface records requiring manual review
- Build the data models supporting Revenue Cycle (e.g. OnCore-to-NetSuite) reconciliation (matching clinical events to financial transactions)
- Develop the unbilled-vs-billed tracking datasets that compare recognized sales order lines against invoiced amounts
- Create revenue accrual support datasets that feed the finance team’s automated journal entry processes in NetSuite
- Support pass-through item mapping and amendment pricing reconciliation data needs as the finance team defines requirements
- Establish CI/CD patterns for Databricks notebooks and jobs (Repos integration, testing frameworks)
- Configure and manage job scheduling, cluster policies, and cost optimization
- Maintain dev/staging/production environment separation
- Document all pipelines, data models, and operational procedures
Requirements:
- 4+ years of experience as a data engineer, with at least 2 years on Azure Databricks or equivalent Spark-based platforms
- Strong/Current proficiency in Azure Data Factory, ADLS Gen2, Databricks, Delta Lake, Databricks SQL, Azure SQL
- Strong/Current proficiency in Python, SQL, PySpark, and Spark SQL
- Experience with Data Lakes, Lake Houses and Warehouses
- Experience building production data pipelines with proper error handling, retry logic, idempotency, and monitoring
- Familiarity Azure and Azure ecosystem services
- Experience with Unity Catalog, Purview or equivalent data governance/catalog tooling
- Experience with Data Governance guidelines such as Data Classification, Retention, De-Identification, Tenancy, Sovereignty and Data Standards
- Experience with CI/CD for data engineering workloads (Databricks Repos, Azure DevOps, or similar)
- Experience with MDM data pipelines (matching, deduplication, golden record logic)
- Familiarity with ERP data models (NetSuite preferred) or clinical trial management systems (OnCore)
- Experience with Snowflake (existing analytical layer we are integrating with)
- Experience with Boomi or other iPaaS tools from a data engineering perspective
- Background in healthcare, life sciences, or clinical research data
- Experience building financial reconciliation or revenue recognition datasets