Jobs
My ads
My job alerts
Sign in
Find a job Career Tips Companies
Find

Operations support system engineer

Guernsey
asobbi
Systems engineer
Posted: 15h ago
Offer description

About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, theyre quickly becoming a powerhouse in the US and Europes AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.

As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.

System Maintenance and Performance Optimization
Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.
Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.

Networking and Infrastructure Support
Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.
Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.
Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.
Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.

Security, Automation, and Monitoring
Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.
Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.
Troubleshooting and Client Support
Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.
Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.
Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.

Support the ongoing development of internal HPC test environments and customer POCs.
Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.
Provide technical documentation, training, and mentorship to junior team members.

Apply
Create E-mail Alert
Job alert activated
Saved
Save
Similar job
Integration & systems engineer
Guernsey
Talent
Systems engineer
See more jobs
Similar jobs
It jobs in Guernsey
jobs Guernsey
jobs Guernsey
jobs Channel Islands
Home > Jobs > It jobs > Systems engineer jobs > Systems engineer jobs in Guernsey > Operations Support System Engineer

About Jobijoba

  • Career Advice
  • Company Reviews

Search for jobs

  • Jobs by Job Title
  • Jobs by Industry
  • Jobs by Company
  • Jobs by Location
  • Jobs by Keywords

Contact / Partnership

  • Contact
  • Publish your job offers on Jobijoba

Legal notice - Terms of Service - Privacy Policy - Manage my cookies - Accessibility: Not compliant

© 2025 Jobijoba - All Rights Reserved

Apply
Create E-mail Alert
Job alert activated
Saved
Save