Callosum is the Intelligent Systems Company focused on creating heterogeneous AI solutions. The Inference System & Performance Engineer will be responsible for owning end-to-end performance for inference platforms, including optimizing KV cache strategies, memory management, and multi-node scheduling.

Responsibilities:

Design and optimise inference serving systems across heterogeneous multi-GPU and multi-node environments
Own KV cache lifecycle management, batching strategies, and memory allocation to maximise throughput and minimise latency
Profile and tune GPU kernels, identify bottlenecks across compute, memory, and network, and implement targeted optimisations
Build and improve scheduling logic for continuous batching, disaggregated prefill/decode, and speculative decoding
Work with networking primitives - NCCL, NVLink, RDMA, InfiniBand, RoCE - to optimise communication across distributed inference workloads
Develop tooling for performance visibility, regression detection, and benchmarking across hardware configurations

Inference System & Performance Engineer - Member of Technical Staff

Key skills

About this role

Responsibilities: