Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We are seeking an AI Data Infrastructure Engineer to build and operate large-scale data systems that power modern AI training and evaluation pipelines, focusing on ingestion, transformation, quality assurance, and high-throughput delivery of data.
Responsibilities:
- Design and operate large-scale data pipelines supporting AI training, evaluation, and continual improvement workflows
- Build ingestion systems for diverse modalities including text, image, audio, video, and structured signals
- Implement data cleaning, deduplication, filtering, and quality assurance at petabyte scale
- Develop dataset versioning, lineage, and provenance tracking systems suitable for reproducible training
- Build high-throughput data loading systems that maximize GPU utilization during training
- Implement labeling workflows, active learning pipelines, and human-in-the-loop data improvement systems
- Design storage architectures balancing cost, throughput, and latency across data tiers
- Build evaluation dataset construction pipelines with strict integrity and contamination controls
- Implement data privacy, redaction, and consent enforcement throughout the pipeline
- Collaborate with ML researchers and engineers to align data systems with model development needs
- Drive observability of data quality, drift, and pipeline health across the AI data estate
- Optimize cost and performance through compression, format selection, and caching strategies
- Document data systems, schemas, and operational procedures for broad internal use
- Stay current with AI data infrastructure research and emerging open-source tools
Requirements:
- Bachelor's or Master's degree in Computer Science or a related field
- Six or more years of data engineering experience, with significant work supporting ML or AI workloads
- Strong proficiency in Python and at least one JVM or systems language
- Deep experience with modern data processing frameworks such as Spark, Ray, or Beam
- Hands-on experience operating petabyte-scale storage and pipeline systems
- Strong understanding of distributed systems, data modeling, and storage formats
- Experience with dataset versioning, lineage, and reproducibility for ML workflows
- Familiarity with high-throughput data loading for accelerator-based training
- Strong software engineering practices including testing, CI/CD, and code review
- Excellent communication and cross-functional collaboration skills
- Experience with multimodal datasets at large scale
- Familiarity with data quality tooling and dataset evaluation methodology
- Exposure to privacy-preserving data systems and regulated data handling
- Open-source contributions to data infrastructure projects
- Experience supporting frontier model training pipelines