Build and maintain high-performance streaming and batch data pipelines that power AI applications, ensuring reliable low-latency ingestion and high-throughput processing
Implement and extend embedding generation workflows, vector store integrations, and retrieval pipelines supporting semantic search, RAG systems, and AI assistants
Develop and optimize scalable storage and retrieval patterns, focusing on cost-efficient architecture and smooth production performance
Implement AI-optimized data models and storage patterns that align with broader enterprise architecture and platform requirements
Collaborate closely with infrastructure, ML engineering, product, and governance teams to deliver production-ready AI capabilities
Lead by example through strong execution, high-quality code, and proactive problem solving
Requirements
5+ years of data engineering experience, with at least 1 year in a lead or senior technical role
Experience building and scaling streaming data pipelines in large-scale, distributed environments
Strong skills in Python, Java, and SQL with expert level skill in either Python or Java
Proven experience building streaming data pipelines (e.g., Kafka, Flink, Spark, Kinesis)
Experience with embedding pipelines and vector stores (e.g., Pinecone, Weaviate, FAISS, pgvector)
Strong knowledge of data modeling, storage optimization, and retrieval patterns for large-scale systems
Hands-on experience with workflow orchestration tools (Airflow, Dagster, etc.)
Familiarity with testing, monitoring, and automation for data pipelines
Tech Stack
Airflow
Java
Kafka
Python
Spark
SQL
Benefits
A bonus and/or long-term incentive units may be provided as part of the compensation package
Full range of medical, financial, and/or other benefits