onPhase is a leading provider of accounting and finance process automation software specializing in Accounts Payable and Accounts Receivable automation. They are seeking a Principal Data Engineer to lead Document Intelligence initiatives, focusing on machine learning, data science, and intelligent document processing. The role involves designing systems to convert unstructured document data into actionable intelligence and collaborating with various teams to achieve this goal.

Responsibilities:

Lead research and engineering efforts in document intelligence, including OCR post-processing, document classification, information extraction, and layout understanding
Design and implement scalable machine learning pipelines and data architectures that support document AI workloads in production environments
Define the technical vision and roadmap for document intelligence capabilities across the organization
Collaborate with cross-functional teams to translate business requirements into ML system designs, model architectures, and data platform decisions
Evaluate, adapt, and extend state-of-the-art NLP and vision-language models for document understanding tasks
Establish best practices for ML experimentation, model versioning, evaluation, and deployment (MLOps)
Mentor and provide technical guidance to engineers and researchers across the team
Drive data architecture decisions that support both model training pipelines and downstream analytics and reporting needs
Publish or present research findings internally and, where appropriate, externally

Requirements:

10+ years of professional experience in R&D, machine learning, applied research, or data engineering
Deep expertise in Document Intelligence — including OCR, document parsing, layout analysis, information extraction, and classification
Strong data architecture background, including experience designing data lakes, feature stores, and ML data pipelines
Proficiency in Python and relevant ML frameworks (PyTorch, TensorFlow, HuggingFace Transformers, etc.)
Experience taking ML models from research and prototyping through to production deployment at scale
Solid understanding of NLP fundamentals and modern large language/vision-language model architectures
Experience with cloud-based ML platforms and infrastructure (AWS, GCP, or Azure)
Strong written and verbal communication skills — ability to convey complex technical concepts to both technical and non-technical stakeholders
PhD or Master's degree in Computer Science, Machine Learning, Computational Linguistics, or a closely related field
Experience with document AI frameworks such as LayoutLM, Donut, PaddleOCR, Amazon Textract, or similar
Publications or contributions to peer-reviewed research in NLP, computer vision, or document understanding
Familiarity with enterprise document workflows — AP automation, contract processing, medical records, or similar domains
Prior experience in a principal, staff, or lead engineer capacity with ownership over a technical domain

Principle Data Engineer

Key skills

About this role

Responsibilities:

Requirements: