About the Company
A rapidly growing cloud provider is redefining high-performance computing with cutting-edge GPUaaS, delivering scalable, enterprise-grade AI infrastructure at unmatched efficiency. With deep ties to Nvidia, they’re quickly becoming a powerhouse in the US and Europe’s AI and ML ecosystem, providing solutions for HPC, AI, and deep learning workloads.
Role Overview
As the Principal HPC Support Engineer, you will play a pivotal role in maintaining and supporting high-performance computing environments on bare-metal infrastructure, primarily serving clients in research, higher education, and enterprise AI sectors. You will focus on both the software and networking aspects of HPC deployments, ensuring that large-scale GPU clusters remain operational, secure, and optimized for client needs.
Key Responsibilities
System Maintenance and Performance Optimization
• Manage, maintain, and tune bare-metal HPC clusters running Linux-based operating systems (e.g., Fedora, Debian, Ubuntu).
• Optimize Nvidia GPU compute environments, including CUDA, NCCL, and GPU resource management in multi-node HPC clusters.
• Oversee high-speed networking configurations, including InfiniBand (Mellanox), RDMA, and Ethernet fabric tuning for low-latency HPC workloads.
• Configure and fine-tune HPC schedulers (e.g., Slurm, OpenPBS, SGE) for optimal GPU workload distribution.
• Implement containerization strategies (Podman, Docker) and orchestration platforms (K3s, Kubernetes) for managing distributed AI/ML workloads.
Networking and Infrastructure Support
• Configure, monitor, and troubleshoot high-performance network fabrics, ensuring low-latency, high-throughput communication between GPU nodes.
• Deploy and maintain InfiniBand, RoCE, and high-speed Ethernet for HPC and AI clusters.
• Collaborate with networking teams to optimize routing, switching, and load balancing for distributed computing environments.
• Work closely with Nvidia engineers and system architects to implement GPUDirect Storage, NVLink, and Magnum IO for accelerated workloads.
Security, Automation, and Monitoring
• Maintain authentication and authorization systems such as Active Directory, OpenLDAP, and Keycloak.
• Automate system provisioning and configuration using Ansible, Terraform, or other Infrastructure-as-Code tools.
• Monitor system performance using Prometheus, Grafana, and ELK Stack, identifying and resolving bottlenecks in GPU workloads.
• Implement security best practices for multi-tenant HPC clusters, ensuring compliance with industry standards.
Troubleshooting and Client Support
• Serve as the lead technical resource for diagnosing and resolving complex software, networking, and hardware issues in large-scale GPU clusters.
• Analyze logs, conduct performance profiling, and debug CUDA, MPI, and RDMA-related issues.
• Work closely with AI/ML research teams, cloud engineers, and enterprise clients to optimize workload performance.
Collaboration and Process Improvement
• Support the ongoing development of internal HPC test environments and customer POCs.
• Work cross-functionally with Service Desk, Operations, and Service Delivery Management to ensure seamless service.
• Provide technical documentation, training, and mentorship to junior team members.