Biohub is a non-profit research lab focused on accelerating scientific discovery through AI and advanced computing. They are seeking a Staff HPC Engineer to lead the evolution of their hybrid HPC and AI platform, integrating cutting-edge technology to support AI biology research and enhance computational capabilities.
Responsibilities:
- Build and support a hybrid HPC-AI environment with large-scale on-prem compute/storage and elastic cloud GPU clusters (Coreweave, AWS, GCP)
- Architect and optimize environments for large-scale AI training and tuning, and low-latency scientific workloads
- Integrate MLOps and model deployment pipelines into HPC infrastructure, ensuring reproducibility and efficiency
- Implement advanced resource scheduling and orchestration (Slurm, Kubernetes, SUNK) optimized for mixed HPC and AI workflows
- Support researchers with job optimization, GPU utilization best practices, and performance tuning for AI and HPC applications
- Evaluate, deploy, and maintain AI/ML software stacks (e.g., PyTorch, TensorFlow, Hugging Face, RAPIDS) and HPC toolchains
- Ensure robust data ingest, analysis, and management capabilities for AI and HPC workloads, including integration with parallel file systems and object storage
- Work with diverse science teams to translate research requirements into hardware/software solutions, from experimental design through publication
- Promote best practices for AI model training, validation, and deployment in shared computing environments
- Foster a culture of shared learning by running internal workshops on HPC-AI tooling (e.g., VS Code remote dev, containerization, MLOps workflows)