Era4 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. They are seeking a Platform Engineer (HPC & AI) to assist in shaping their new Platform team, focusing on customer-facing technical troubleshooting and collaboration with vendor engineering teams for seamless AI platform operations.

Responsibilities:

Designing, deploying, and managing large‑scale HPC and GPU‑accelerated clusters, including NVIDIA based compute environments
Implementing and administering HPC scheduling and resource‑management systems (e.g., Slurm), including GPU partitioning, workload scheduling, and capacity planning
Architecting and optimising InfiniBand and Ethernet network topologies
Ensuring high availability and resilience through failover strategies, planned maintenance coordination, and proactive risk mitigation
Automating provisioning, configuration, monitoring, and operational workflows across multi‑vendor HPC hardware and software stacks
Monitoring real‑time performance and leading troubleshooting efforts across compute, storage, interconnect, drivers, and node failures, engaging vendor support for critical issues
Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues
Security and access control: Manage user permissions, RBAC, security hardening, data protection

Platform Engineer

Key skills

About this role

Responsibilities: