Oriole Networks is looking for a Senior ML Systems Engineer to build and validate simulation infrastructure for large-scale machine learning systems. This role involves modeling compute and communication behavior of systems used for ML training and inference and using simulation to guide architecture and performance optimization.
Responsibilities:
- Build simulation models for compute, memory, interconnect, and communication behavior in ML systems
- Develop tools to simulate performance for training and inference workloads
- Model distributed execution across accelerators, hosts, and network fabrics, including collectives, synchronization, and communication bottlenecks
- Use simulation and analytical modelling to evaluate tradeoffs, identify bottlenecks, and guide system design
- Run performance experiments and benchmarks on real ML systems to calibrate and validate simulation models
- Analyze end-to-end performance, including throughput, latency, scaling efficiency, utilization, and cost/performance tradeoffs
- Partner with hardware/software/Networking/ML teams to align simulation with real workloads and constraints
- Create reproducible benchmarking methodologies across models, system configurations, and compare against real system measurements to prove validity
- Communicate findings through technical reports and design recommendations