Callosum is the Intelligent Systems Company focused on creating heterogeneous AI solutions. The Inference System & Performance Engineer will be responsible for owning end-to-end performance for inference platforms, including optimizing KV cache strategies, memory management, and multi-node scheduling.
Responsibilities:
- Design and optimise inference serving systems across heterogeneous multi-GPU and multi-node environments
- Own KV cache lifecycle management, batching strategies, and memory allocation to maximise throughput and minimise latency
- Profile and tune GPU kernels, identify bottlenecks across compute, memory, and network, and implement targeted optimisations
- Build and improve scheduling logic for continuous batching, disaggregated prefill/decode, and speculative decoding
- Work with networking primitives - NCCL, NVLink, RDMA, InfiniBand, RoCE - to optimise communication across distributed inference workloads
- Develop tooling for performance visibility, regression detection, and benchmarking across hardware configurations