Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. They are seeking a Platform Engineer (HPC & AI) to assist in shaping their new Platform team, focusing on customer-facing technical troubleshooting and collaboration with vendor engineering teams for seamless AI platform operations.
Responsibilities:
- Designing, deploying, and managing large‑scale HPC and GPU‑accelerated clusters, including NVIDIA based compute environments
- Implementing and administering HPC scheduling and resource‑management systems (e.g., Slurm), including GPU partitioning, workload scheduling, and capacity planning
- Architecting and optimising InfiniBand and Ethernet network topologies
- Ensuring high availability and resilience through failover strategies, planned maintenance coordination, and proactive risk mitigation
- Automating provisioning, configuration, monitoring, and operational workflows across multi‑vendor HPC hardware and software stacks
- Monitoring real‑time performance and leading troubleshooting efforts across compute, storage, interconnect, drivers, and node failures, engaging vendor support for critical issues
- Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues
- Security and access control: Manage user permissions, RBAC, security hardening, data protection