Operations & support engineer (hpc)

Brighton

asobbi

Support engineer

Posted: 5 November

Offer description

About the Company

A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.

Role Overview

As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.

Key Responsibilities

System Maintenance and Performance Optimization

• Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Fedora, Debian, Ubuntu).

• Optimize Nvidia GPU compute environments, including CUDA, NCCL, and GPU resource management in multi-node HPC clusters.

• Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.

• Configure and fine-tune HPC schedulers (e.g., Slurm, OpenPBS, SGE) for optimal GPU workload distribution.

• Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.

Networking and Infrastructure Support

• Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.

• Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.

• Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.

• Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.

Security, Automation, and Monitoring

• Maintain authentication and authorization systems such as Active Directory, OpenLDAP, and Keycloak.

• Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.

• Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.

• Implement security best practices for multi-tenant HPC clusters, ensuring compliance with industry standards.

Troubleshooting and Client Support

• Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.

• Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.

• Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.

Collaboration and Process Improvement

• Support the ongoing development of internal HPC test environments and customer POCs.

• Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.

• Provide technical documentation, training, and mentorship to junior team members.

Apply

Create E-mail Alert

Save

Similar job

Platform support engineer (java)

Crawley

Morson Edge

Support engineer

£60,000 a year

Similar job

Operations & support engineer (hpc)

Crawley

asobbi

Support engineer

Similar job

Senior 2nd line support engineer

Horsham

AJW Group

Support engineer

€40,000 a year