Jobs
My ads
My job alerts
Sign in
Find a job Career Tips Companies
Find

Operations & support engineer (hpc)

Newcastle Upon Tyne (Tyne and Wear)
asobbi
Support engineer
Posted: 18h ago
Offer description

About the Company

A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.


Role Overview

As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.


Key Responsibilities


System Maintenance and Performance Optimization

• Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Fedora, Debian, Ubuntu).

• Optimize Nvidia GPU compute environments, including CUDA, NCCL, and GPU resource management in multi-node HPC clusters.

• Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.

• Configure and fine-tune HPC schedulers (e.g., Slurm, OpenPBS, SGE) for optimal GPU workload distribution.

• Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.


Networking and Infrastructure Support

• Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.

• Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.

• Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.

• Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.


Security, Automation, and Monitoring

• Maintain authentication and authorization systems such as Active Directory, OpenLDAP, and Keycloak.

• Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.

• Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.

• Implement security best practices for multi-tenant HPC clusters, ensuring compliance with industry standards.


Troubleshooting and Client Support

• Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.

• Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.

• Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.


Collaboration and Process Improvement

• Support the ongoing development of internal HPC test environments and customer POCs.

• Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.

• Provide technical documentation, training, and mentorship to junior team members.

Apply
Create E-mail Alert
Job alert activated
Saved
Save
Similar job
Information technology support engineer
Newcastle Upon Tyne (Tyne and Wear)
NRG
Support engineer
Similar job
Broadcast support engineer itv careers
Gateshead
Deaf Unity
Support engineer
Similar job
Business support engineer
Cramlington
AAF International - Power & Industrial Group
Support engineer
See more jobs
Similar jobs
It jobs in Newcastle Upon Tyne (Tyne and Wear)
jobs Newcastle Upon Tyne (Tyne and Wear)
jobs Tyne and Wear
jobs England
Home > Jobs > It jobs > Support engineer jobs > Support engineer jobs in Newcastle Upon Tyne (Tyne and Wear) > Operations & Support Engineer (HPC)

About Jobijoba

  • Career Advice
  • Company Reviews

Search for jobs

  • Jobs by Job Title
  • Jobs by Industry
  • Jobs by Company
  • Jobs by Location
  • Jobs by Keywords

Contact / Partnership

  • Contact
  • Publish your job offers on Jobijoba

Legal notice - Terms of Service - Privacy Policy - Manage my cookies - Accessibility: Not compliant

© 2025 Jobijoba - All Rights Reserved

Apply
Create E-mail Alert
Job alert activated
Saved
Save