Memorial Sloan Kettering Cancer Center (MSK) is dedicated to ending cancer for life through innovative research and clinical care. They are seeking a Data Engineer to join the Prostate Cancer Clinical Trials Consortium, where the role involves designing and maintaining data storage and access infrastructure for clinical trials, ensuring data is organized and available for analysis.

Responsibilities:

Implement and maintain relational database structures for clinical trial data storage in AWS S3, using tools such as DuckDB and/or DuckLake
Build and maintain ETL pipelines that ingest data from clinical trial data systems (e.g. EDCs), transforming raw clinical data into organized, versioned, analysis-ready datasets
Develop access layers (database connectors, internal R packages or utilities) that enable our R-focused Data Science Team to query and retrieve data efficiently
Implement and maintain access management and permissioning structures across data systems, including SharePoint and Airtable, ensuring consistent and scalable controls as the team and trial portfolio grow
Maintain data governance standards, including naming conventions, versioning, and documentation, across our active trial portfolio
Collaborate with Clinical Operations and Data Management teams to understand data flows from sites and ensure upstream processes align with downstream analytic needs
Use GitHub Enterprise for version control and contribute to CI/CD workflows for pipeline automation where infrastructure allows

Requirements:

An undergraduate degree, preferably in computer science, data engineering, information systems, or a related field
2–4 years of experience building or maintaining data pipelines, ETL processes, or database systems
Working knowledge of SQL and relational database concepts
Familiarity with cloud storage (AWS S3 preferred) and infrastructure-as-code principles
Experience with access management, permissioning, or user administration across collaborative platforms
Exposure to version control (git/GitHub)
Passion for data and creating reliable systems that empower cancer care and clinical research
Strong problem-solving and analytical thinking skills with the ability to troubleshoot complex data and system issues
Excellent collaboration and communication skills, with the ability to work effectively across technical and non-technical teams
Highly organized with strong attention to detail and a commitment to data accuracy, quality, and documentation
Ability to manage multiple priorities in a fast-paced environment while meeting deadlines
Proactive, adaptable, and eager to learn new technologies and contribute to continuous process improvement
Self-motivated and able to work independently in a fully remote environment while remaining an engaged team member
CDISC data standards (SDTM, ADaM) and how clinical trial data is structured
DuckDB, DuckLake, or similar analytical database technologies complex & relational data sets
R (you won't need to be an R programmer, but understanding how R users consume data will make you more effective)
Airtable structure and maintenance (we have several operational data systems here)
SharePoint administration and file system permissioning at scale
CI/CD for orchestrating automated data pipelines
Clinical trial data lifecycle, from EDC capture through analysis-ready datasets

Data Engineer, PCCTC

Key skills

About this role

Responsibilities:

Requirements: