Build and fine‑tune models for detection, tracking, segmentation (2D/3D), pose & activity recognition, and scene understanding (incl. 360° and multi‑view)
Train/evaluate vision–language models (VLMs) for grounding, dense captioning, temporal QA, and tool‑use; design retrieval‑augmented and agentic loops for perception‑action tasks
Prototype perception‑in‑the‑loop policies that close the gap from pixels to actions (simulation + real data)
Curate datasets, author high‑signal evaluation protocols/KPIs, and run ablations that make results irreproducible impossible
Package research into reliable services on a modern stack (Kubernetes, Docker, Ray, FastAPI), with profiling, telemetry, and CI for reproducible science
Orchestrate multi‑agent pipelines (e.g., LangGraph‑style graphs) that combine perception, reasoning, simulation, and code‑generation to self‑check and self‑correct
Requirements
Ph.D. student in CS/EE/Robotics (or related), actively publishing in CV/ML/Robotics
Strong PyTorch (or JAX) and Python; comfort with CUDA profiling and mixed‑precision training
Demonstrated research in computer vision and at least one of: VLMs (e.g., LLaVA‑style, video‑language models), embodied/physical AI, 3D perception
Proven ability to move from paper → code → ablation → result with rigorous experiment tracking