Salary: Competitive + package (depending on experience)
Type: Full-time
A leading consulting and technology organisation is looking to hire a number of HPC Systems Administrator / Consultant to join a growing High Performance Compute operations team supporting next-generation AI infrastructure projects across the UK.
This role will focus on the design, deployment and operation of high-density compute environments, supporting advanced GPU clusters and AI model training platforms. The successful candidate will work with cutting-edge compute stacks and play a key role in enabling high-performance AI workloads. Due to the nature of the work, this role will involve secure and sensitive environments.
Key Responsibilities
* Design, deploy and manage HPC infrastructures, including GPU clusters and parallel computing environments
* Support AI model training platforms by maintaining compute resources and optimising workload scheduling
* Monitor, analyse and optimise system performance, identifying bottlenecks and improving efficiency
* Develop and maintain automation scripts and operational tooling (Python, PowerShell, Bash)
* Maintain clear documentation covering architecture, configurations, operational procedures and incident resolution
* Support incident management processes, including root cause analysis and post-incident reviews
* Work closely with cross-functional teams to ensure reliability, performance and security across HPC environments
Required Experience
* Strong experience working in High Performance Computing (HPC) environments
* Experience managing GPU clusters (e.g. NVIDIA or AMD)
* Familiarity with workload schedulers such as SLURM or PBS
* Experience supporting AI/ML model training frameworks such as TensorFlow, PyTorch or CUDA
* Solid understanding of Linux and Windows server environments, networking and storage platforms
* Strong troubleshooting and performance optimisation skills within compute-heavy environments
* Experience with automation, scripting and monitoring tools (Python, PowerShell, Bash)
* Excellent communication skills and ability to work with both technical and non-technical stakeholders
#J-18808-Ljbffr