Platform engineer - hpc, ai and ml

Slough

Cloud People

Platform engineer

Posted: 17 November

Offer description

Platform Engineer – HPC, AI and ML

Up to £80,000 plus benefits

Onsite – Kensington, London

Company and Role

This is an opportunity to join a global technology and AI solutions provider delivering some of the most advanced computing platforms in the world. You will play a leading role in the design, build and long-term support of a next generation AI and Machine Learning platform, built on cutting-edge High Performance Computing (HPC) infrastructure for one of the UK’s most prestigious research environments.

As a Platform Engineer – HPC, AI and ML, you will be responsible for building and optimising a high-performance platform purpose-built for AI, ML, LLM and Generative AI workloads. You will lead on architecture, deployment and performance tuning using technologies such as Kubernetes, NVIDIA Run:AI, Ubuntu, Weka NeuralMesh and HGX B200 GPU nodes.

Once live, you will take ownership of the platform’s operation and evolution, ensuring it delivers consistent world-class performance for advanced research workloads. It is a rare opportunity to build a complex HPC environment from the ground up and then own it, ensuring it continues to power the next generation of AI-driven innovation.

Why This Role Stands Out

• Be part of one of the UK’s most advanced AI and HPC platform projects

• Build and then support a world-class infrastructure enabling AI, ML, LLM and Generative AI research

• Collaborate with global technology leaders including NVIDIA, HPE, Canonical and Weka

• Onsite role in Kensington, London within a pioneering research and innovation environment

• Salary up to £80,000 with excellent opportunities for growth in HPC and AI infrastructure engineering

What You’ll Be Doing

• Designing, deploying and configuring a complete AI and ML operations platform within a large-scale HPC environment

• Installing and optimising Ubuntu (Canonical) across GPU and non-GPU compute nodes

• Implementing and managing Kubernetes for container orchestration and performance at scale

• Installing and configuring NVIDIA GPU Operator, Network Operator and Run:AI orchestration platform

• Integrating Run:AI with Kubernetes clusters to deliver scalable GPU utilisation

• Supporting deployment of HGX B200 GPU nodes (96 NVIDIA B200 GPUs) and associated infrastructure

• Managing Weka NeuralMesh distributed AI storage for high-speed data access and resilience

• Implementing CI/CD and MLOps pipelines using Argo Workflows, Jenkins and GitHub

• Monitoring platform performance using Zabbix, Prometheus and Grafana

• Integrating SAN and Infiniband networking to achieve high throughput and reliability

• Creating detailed documentation and performing knowledge transfer to operations teams

• Providing ongoing platform support, patching, troubleshooting and continuous improvement

What You’ll Bring

• Proven experience designing, deploying and supporting HPC or large-scale compute environments for AI and ML workloads

• Strong understanding of Ubuntu server administration, networking and performance tuning

• Hands-on experience with Kubernetes and GPU-enabled workloads

• Practical knowledge of NVIDIA GPU technologies, particularly GPU Operator and Run:AI

• Familiarity with distributed storage and AI data systems such as Weka NeuralMesh

• Experience with CI/CD and MLOps pipelines using Argo, Jenkins or GitHub

• Knowledge of HPC networking including SAN and Infiniband integration

• Strong troubleshooting and documentation skills with a collaborative mindset

Desirable Experience

• Certifications in Kubernetes, NVIDIA or HPC infrastructure technologies

• Experience in research, academic or scientific computing environments

• Understanding of AI and ML workflows, neural network training and large language models

• Familiarity with HPE, NVIDIA, Aarna and Digital Realty platforms

If you are passionate about building and operating large-scale computing environments and want to play a key role in delivering one of the UK’s most advanced HPC and AI platforms, this is your opportunity to shape the future of research and machine learning infrastructure.

Apply

Create E-mail Alert

Save

Similar job

Platform engineer

Reading (Berkshire)

OpticoreIT

Platform engineer

Similar job

Platform engineer

Farnborough (Hampshire)

The Access Group

Platform engineer

Similar job

Dv cleared - security platform engineer

Farnborough (Hampshire)

The Talent Locker Ltd.

Platform engineer

€550 a month