Delos-data is a stealth-mode startup focused on building foundational technology for large-scale AI data center clusters. They are seeking a talented System Software Engineer to design and implement communication and execution primitives for efficient AI model operations across thousands of GPUs.
Responsibilities:
- Collaborate across the stack to influence the design of our foundational technology, ensuring it meets the needs of next-generation AI models
- Identify and resolve performance bottlenecks in distributed training and inference workloads through deep-dive analysis of the software-hardware interface
- Conduct rigorous performance benchmarking and characterization on multi-node clusters
Requirements:
- Strong proficiency in C++ and Python, with a deep understanding of systems programming fundamentals (memory management, concurrency, OS internals)
- Proficient in a Linux development environment
- Bachelor's or Master's degree in Computer Engineering, Computer Science, or a related field
- Experience with GPU programming (CUDA) and performance optimization for parallel architectures
- Familiarity with distributed AI frameworks (PyTorch, JAX, or DeepSpeed) and/or inference engines (vLLM, SGLang, Dynamo/TRT-LLM)
- Hands-on experience with large-scale cluster orchestration and telemetry tools