SPREEAI is building the future of AI-powered commerce through photorealistic virtual try-on and multimodal intelligence. They are looking for a Principal Engineer to build the infrastructure, deployment pipelines, and observability systems that enable multimodal AI models to move from research prototypes to reliable, production-grade deployments.

Responsibilities:

Build and operate SPREEAI’s end-to-end ML platform spanning training, evaluation, deployment, and monitoring
Enable scalable and reliable training workflows through orchestration, infrastructure, and resource management systems
Define platform standards for model packaging, model registry, dataset lineage, experiment tracking, checkpointing, and deployment automation
Enable reliable and scalable inference deployments through standardized serving, orchestration, and monitoring frameworks
Build and operate model deployment pipelines with versioning, reproducibility, rollback, approval gates, evaluation gates, and production observability
Establish production SLOs for latency, availability, error rate, GPU saturation, cold-start time, cost per inference, and model quality drift
Standardize and support serving infrastructure using modern inference runtimes such as vLLM, NVIDIA Triton, TensorRT-LLM, Ray Serve, TorchServe, ONNX Runtime, or equivalent systems
Design and manage GPU allocation, scheduling, and resource utilization across training and inference workloads
Improve GPU utilization, throughput, latency, reliability, and cost efficiency across model lifecycle systems
Design and operate model evaluation and benchmarking systems, including automated regression detection and quality gates for production releases
Partner with research teams to productionize new capabilities by providing robust infrastructure, tooling, and deployment pathways

Principal Engineer, AI Platform & Infrastructure

Key skills

About this role

Responsibilities: