Trans Ova Genetics is a company focused on animal genetics and bioinformatics. They are seeking a Sr. Data Engineer responsible for designing, developing, and maintaining data integration, analytics, and reporting solutions to support their workloads.
Responsibilities:
- Design, develop, and maintain robust and efficient ETL/ELT pipelines and processes on Databricks for both operational and bioinformatics datasets (e.g., genomic markers, phenotypic data, laboratory outputs)
- Ingest, transform, and harmonize structured and semi-structured biological data from lab systems, LIMS, sequencing platforms, and external partners into the enterprise data platform
- Troubleshoot and resolve Databricks pipeline errors and performance issues
- Optimize data flow performance and minimize data latency across scientific and business use cases
- Implement data quality checks, validations, and reconciliation processes within ETL workflows, including domain-specific checks for genomic and phenotypic data
- Develop and maintain Databricks pipelines, notebooks, and datasets using Python, Spark, and SQL
- Optimize Databricks jobs for performance and cost-effectiveness, including largescale bioinformatics and analytics workloads
- Integrate Databricks with other data sources and systems, including lab instruments, genomic databases, and on-prem or cloud data stores
- Participate in the design and implementation of data lake architectures that support both traditional analytics and bioinformatics pipelines
- Participate in the design and implementation of data warehousing solutions to support reporting, analytics, and scientific modeling
- Model and curate subject areas for genetics, reproduction, and bioinformatics (e.g., animals, pedigrees, genotypes, breeding values, trials)
- Support data quality initiatives and implement data cleansing procedures across business and scientific domains
- Collaborate with business users, scientists, geneticists, and bioinformaticians to understand data requirements for department-driven reporting and analytics needs
- Maintain and extend the existing library of complex dashboards and visualizations to surface genetic, reproductive, and operational insights
- Enable self-service analytics for R&D and product teams by exposing well-governed, documented data products
- Troubleshoot and resolve report issues, including performance bottlenecks and data inconsistencies
- Apply strong programming skills in Python, SQL, and Spark to build scalable data and bioinformatics workflows
- Use CI/CD and IaC tools (Terraform, ARM, CloudFormation) to automate deployment of data platform components and analytics environments
- Design and implement Databricks platform architecture on Azure and AWS infrastructure, including environments that support largescale scientific computation
- Contribute to cloud security, governance, and cost optimization practices for data and bioinformatics workloads
- Partner with geneticists, biostatisticians, and bioinformaticians to translate scientific requirements into scalable data and platform architectures
- Support or orchestrate bioinformatics pipelines (e.g., variant processing, quality control, annotation, genotype imputation, genomic evaluation) using cloud and Databricks capabilities
- Ensure that data models, pipelines, and storage structures meet the needs of downstream analytics, predictive models, and genetic evaluations
- Advocate for best practices in managing sensitive biological and genetic data, including data governance, access control, and compliance with relevant standards and regulations
- Thrive in an entrepreneurial, self-starting, and fast-paced environment, working both independently and with our highly skilled teams
- Collaborate effectively with business users, data analysts, scientists, and other IT teams
- Communicate technical information clearly and concisely, both verbally and in writing, to technical and nontechnical stakeholders
- Document all development work, data models, and procedures thoroughly, including bioinformatics and scientific data flows
- Keep abreast of the latest advancements in data integration, cloud platforms, bioinformatics tooling, and data engineering technologies
- Continuously improve skills and knowledge through training and self-learning in both data engineering and bioinformatics domains
Requirements:
- Bachelor's degree in Computer Science, Information Systems, Bioinformatics, Computational Biology, or a related field; a Master's degree is an asset
- 7+ years of experience in data integration and reporting, with experience designing and operating cloud-based data platforms
- Extensive experience with Databricks, including Python, Spark, and Delta Lake
- Strong proficiency with relational databases (e.g., SQL Server, RDS), including TSQL, stored procedures, and functions
- Experience with data warehousing concepts and best practices
- Experience with Microsoft Azure cloud platform; exposure to Microsoft Fabric is desirable
- Hands on experience working with biological, genomic, or other omics datasets in a bioinformatics or life sciences setting (e.g., sequence data, SNP arrays, GWAS outputs, phenotypic traits)
- Strong analytical and problem-solving skills, with the ability to reason about complex data and scientific requirements
- Excellent communication and interpersonal skills
- Ability to work independently and as part of a cross-functional team across IT, science, and business
- Experience with Agile methodologies
- Demonstrated background in bioinformatics or computational biology, preferably supporting genetics, breeding, or life science research in an applied or commercial context
- Must be legally authorized to work in the United States
- Familiarity with common bioinformatics tools, data formats (e.g., FASTQ, VCF, PLINK), and workflows is highly desirable