Platform Engineer – HPC, AI and ML
Up to £80,000 plus benefits
Onsite – Kensington, London
Company and Role
This is an opportunity to join a global technology and AI solutions provider delivering some of the most advanced computing platforms in the world. You will play a leading role in the design, build and long-term support of a next generation AI and Machine Learning platform, built on cutting-edge High Performance Computing (HPC) infrastructure for one of the UK’s most prestigious research environments.
As a Platform Engineer – HPC, AI and ML, you will be responsible for building and optimising a high-performance platform purpose-built for AI, ML, LLM and Generative AI workloads. You will lead on architecture, deployment and performance tuning using technologies such as Kubernetes, NVIDIA Run:AI, Ubuntu, Weka NeuralMesh and HGX B200 GPU nodes.
Once live, you will take ownership of the platform’s operation and evolution, ensuring it delivers consistent world-class performance for advanced research workloads. It is a rare opportunity to build a complex HPC environment from the ground up and then own it, ensuring it continues to power the next generation of AI-driven innovation.
Why This Role Stands Out
• Be part of one of the UK’s most advanced AI and HPC platform projects
• Build and then support a world-class infrastructure enabling AI, ML, LLM and Generative AI research
• Collaborate with global technology leaders including NVIDIA, HPE, Canonical and Weka
• Onsite role in Kensington, London within a pioneering research and innovation environment
• Salary up to £80,000 with excellent opportunities for growth in HPC and AI infrastructure engineering
What You’ll Be Doing
• Designing, deploying and configuring a complete AI and ML operations platform within a large-scale HPC environment
• Installing and optimising Ubuntu (Canonical) across GPU and non-GPU compute nodes
• Implementing and managing Kubernetes for container orchestration and performance at scale
• Installing and configuring NVIDIA GPU Operator, Network Operator and Run:AI orchestration platform
• Integrating Run:AI with Kubernetes clusters to deliver scalable GPU utilisation
• Supporting deployment of HGX B200 GPU nodes (96 NVIDIA B200 GPUs) and associated infrastructure
• Managing Weka NeuralMesh distributed AI storage for high-speed data access and resilience
• Implementing CI/CD and MLOps pipelines using Argo Workflows, Jenkins and GitHub
• Monitoring platform performance using Zabbix, Prometheus and Grafana
• Integrating SAN and Infiniband networking to achieve high throughput and reliability
• Creating detailed documentation and performing knowledge transfer to operations teams
• Providing ongoing platform support, patching, troubleshooting and continuous improvement
What You’ll Bring
• Proven experience designing, deploying and supporting HPC or large-scale compute environments for AI and ML workloads
• Strong understanding of Ubuntu server administration, networking and performance tuning
• Hands-on experience with Kubernetes and GPU-enabled workloads
• Practical knowledge of NVIDIA GPU technologies, particularly GPU Operator and Run:AI
• Familiarity with distributed storage and AI data systems such as Weka NeuralMesh
• Experience with CI/CD and MLOps pipelines using Argo, Jenkins or GitHub
• Knowledge of HPC networking including SAN and Infiniband integration
• Strong troubleshooting and documentation skills with a collaborative mindset
Desirable Experience
• Certifications in Kubernetes, NVIDIA or HPC infrastructure technologies
• Experience in research, academic or scientific computing environments
• Understanding of AI and ML workflows, neural network training and large language models
• Familiarity with HPE, NVIDIA, Aarna and Digital Realty platforms
If you are passionate about building and operating large-scale computing environments and want to play a key role in delivering one of the UK’s most advanced HPC and AI platforms, this is your opportunity to shape the future of research and machine learning infrastructure.