Jobs
My ads
My job alerts
Sign in
Find a job Career Tips Companies
Find

Service desk operations engineer

Guernsey
asobbi
Operations engineer
Posted: 16h ago
Offer description

About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, theyre quickly becoming a powerhouse in the US and Europes AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.

As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.

System Maintenance and Performance Optimization
Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.
Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.

Networking and Infrastructure Support
Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.
Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.
Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.
Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.

Security, Automation, and Monitoring
Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.
Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.
Troubleshooting and Client Support
Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.
Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.
Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.

Support the ongoing development of internal HPC test environments and customer POCs.
Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.
Provide technical documentation, training, and mentorship to junior team members.

Apply
Create E-mail Alert
Job alert activated
Saved
Save
See more jobs
Similar jobs
Travel jobs in Guernsey
jobs Guernsey
jobs Guernsey
jobs Channel Islands
Home > Jobs > Travel jobs > Operations engineer jobs > Operations engineer jobs in Guernsey > Service Desk Operations Engineer

About Jobijoba

  • Career Advice
  • Company Reviews

Search for jobs

  • Jobs by Job Title
  • Jobs by Industry
  • Jobs by Company
  • Jobs by Location
  • Jobs by Keywords

Contact / Partnership

  • Contact
  • Publish your job offers on Jobijoba

Legal notice - Terms of Service - Privacy Policy - Manage my cookies - Accessibility: Not compliant

© 2025 Jobijoba - All Rights Reserved

Apply
Create E-mail Alert
Job alert activated
Saved
Save