Soho Square Solutions is seeking a Senior Data Engineer with expertise in the Databricks ecosystem and Medallion Architecture to lead a critical data initiative. The role involves designing and implementing scalable data pipelines to transform unstructured documents into structured datasets, ensuring compliance and audit readiness.
Responsibilities:
- Pipeline Architecture: Architect, build, and maintain production-grade data pipelines utilizing Databricks and Medallion Architecture (Bronze -> Silver -> Gold)
- Unstructured Data Engineering: Design robust frameworks to ingest and transform unstructured data formats (PDFs, images, Word docs, text logs) from enterprise source systems into structured, query-ready Gold-layer assets
- Regulatory Data Curation: Partner with Quality and Compliance teams to model data specifically optimized for rapid audit retrieval and regulatory inspection readiness
- Framework Development: Build reusable data quality validation frameworks, monitoring rules, and error-handling mechanisms across all pipeline stages
- Platform Governance: Leverage Databricks features (Unity Catalog, Workflows) to ensure data lineage, security compliance, and access control across dev and prod workspaces
Requirements:
- 7+ years of hands-on data engineering experience, with a heavy focus on Python, SQL, and PySpark
- 3+ years of production experience designing and deploying Medallion Architecture frameworks natively inside Databricks
- Demonstrated, real-world experience building extraction and parsing pipelines for unstructured data (extracting text/metadata from PDFs, images, docs)
- Proven ability to build highly reliable data transformation frameworks from the ground up
- Technical familiarity or integration experience with enterprise systems: TrackWise, SAP, and Salesforce
- Experience working within highly regulated industries (Life Sciences, Pharma, Biotech, or Medical Devices) under GxP or strict compliance standards
- Experience with document parsing libraries or cloud OCR tools (e.g., Azure Document Intelligence, AWS Textract, Unstructured.io)