onPhase is a leading provider of accounting and finance process automation software specializing in Accounts Payable and Accounts Receivable automation. They are seeking a Principal Data Engineer to lead Document Intelligence initiatives, focusing on machine learning, data science, and intelligent document processing. The role involves designing systems to convert unstructured document data into actionable intelligence and collaborating with various teams to achieve this goal.
Responsibilities:
- Lead research and engineering efforts in document intelligence, including OCR post-processing, document classification, information extraction, and layout understanding
- Design and implement scalable machine learning pipelines and data architectures that support document AI workloads in production environments
- Define the technical vision and roadmap for document intelligence capabilities across the organization
- Collaborate with cross-functional teams to translate business requirements into ML system designs, model architectures, and data platform decisions
- Evaluate, adapt, and extend state-of-the-art NLP and vision-language models for document understanding tasks
- Establish best practices for ML experimentation, model versioning, evaluation, and deployment (MLOps)
- Mentor and provide technical guidance to engineers and researchers across the team
- Drive data architecture decisions that support both model training pipelines and downstream analytics and reporting needs
- Publish or present research findings internally and, where appropriate, externally
Requirements:
- 10+ years of professional experience in R&D, machine learning, applied research, or data engineering
- Deep expertise in Document Intelligence — including OCR, document parsing, layout analysis, information extraction, and classification
- Strong data architecture background, including experience designing data lakes, feature stores, and ML data pipelines
- Proficiency in Python and relevant ML frameworks (PyTorch, TensorFlow, HuggingFace Transformers, etc.)
- Experience taking ML models from research and prototyping through to production deployment at scale
- Solid understanding of NLP fundamentals and modern large language/vision-language model architectures
- Experience with cloud-based ML platforms and infrastructure (AWS, GCP, or Azure)
- Strong written and verbal communication skills — ability to convey complex technical concepts to both technical and non-technical stakeholders
- PhD or Master's degree in Computer Science, Machine Learning, Computational Linguistics, or a closely related field
- Experience with document AI frameworks such as LayoutLM, Donut, PaddleOCR, Amazon Textract, or similar
- Publications or contributions to peer-reviewed research in NLP, computer vision, or document understanding
- Familiarity with enterprise document workflows — AP automation, contract processing, medical records, or similar domains
- Prior experience in a principal, staff, or lead engineer capacity with ownership over a technical domain