About the Company A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, theyre quickly becoming a powerhouse in the US and Europes AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.
As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.
System Maintenance and Performance Optimization
Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.
Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.
Networking and Infrastructure Support
Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.
Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.
Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.
Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.
Security, Automation, and Monitoring
Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.
Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.
Troubleshooting and Client Support
Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.
Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.
Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.
Support the ongoing development of internal HPC test environments and customer POCs.
Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.
Provide technical documentation, training, and mentorship to junior team members.