Unconventional AI is pioneering a new foundation for AI computing that is significantly more efficient. The role focuses on developing a next-generation ML model training platform and involves designing training systems, optimizing performance, and collaborating across teams to push the boundaries of AI technology.
Responsibilities:
- Build and maintain highly optimized, model-specific training stacks specifically tuned for state-of-the-art generative vision, language, and world models
- Design and scale multi-node distributed training systems, implementing elastic sharding and robust data streaming pipelines for fast, large-scale iteration
- Implement and robust model checkpointing and recovery mechanisms
- Develop and optimize kernels using low-level programming models like CUDA and Triton
- Design rigorous benchmarking suites to track Model Flops Utilization (MFU), memory bandwidth, and convergence stability
- Act as a translator, discussing algorithmic trade-offs with theorists and converting model requirements into concrete specifications for infrastructure and hardware engineering teams