Noon is on a mission to reinvent how designers work in the AI era by building next-generation AI design tools for product teams. They are hiring an AI Platform Engineer to own the models' production, optimize latency and cost, and ensure the reliability of AI capabilities in their product.
Responsibilities:
- Architect and operate the inference platform: serving stack, autoscaling, multi-tenancy, observability
- Optimize end-to-end latency (TTFT, TPOT, p95) with quantization, batching, KV-cache management, and speculative decoding
- Design multi-GPU parallelism strategies (DP / TP / PP) and own GPU utilization and cost economics
- Build a hybrid local + cloud serving architecture — small models on the user’s Mac for fast paths, larger models in the cloud for slow paths
- Own canary deployment, rollback automation, and SLO/SLA-driven reliability for all AI features
- Build production observability: latency, drift, quality, and cost dashboards
- Evaluate and integrate inference engines (vLLM, Triton, TGI, TensorRT, MLX) for cloud and on-device paths
- Take fine-tuned models from research artifacts to production traffic