HPC Platform Engineer
The firm is developing a cutting-edge high-performance computing (HPC) platform to support our portfolio managers, developers, quantitative analysts, and data scientists, enabling seamless scaling of compute capabilities both on-premise and in the cloud. We seek a senior, hands-on engineer who is customer-focused and an advocate for customer-driven solutions. The ideal candidate will have a strong understanding of physical and cloud-based infrastructure, experience in automating infrastructure, proficiency in service and infrastructure lifecycle management, and experience in tuning Linux for demanding workloads. They will engage with teams to understand their requirements, drive development for our HPC platforms, and collaborate with other teams for integration. The candidate should also have expertise in Linux systems administration, container orchestration, networking, security, and infrastructure-as-code. Experience integrating, testing, and optimizing the integration of HPC with storage and data platforms is also essential.
Principal Responsibilities
· Collaborate within a customer-focused team to design, develop, test, and deploy HPC infrastructure in alignment with business needs.
· Serve as a Subject Matter Expert (SME) for our Platform offering and customer needs
· Customize and develop home grown or off the shelf solutions to meet customer needs
Optimize performance, reliability and delivery of HPC Platform
Qualifications/Desired Skills
· Deep understanding of Linux operating systems, with substantial practical experience in performance tuning and resource fencing, specifically related to HPC workloads.
· Deep working knowledge of performance tuning Linux specifically around high throughput or high performance computing
· Proficiency in Programming Languages for Automation and Tooling.
· Experience with HPC cluster schedulers, such as Slurm, Grid engine, MOAB, PBS, etc
· Deep working knowledge of containers and container orchestration
· Experience contributing to and collaborating on a shared code base
· Experience with configuration management and automation tools, such as Chef, Ansible, Salt, Packer
· Experience with building monitoring and alerting on logs and metrics
· Excellent written and verbal communications
· Excellent troubleshooting and analytical skills
· Self-starter able to execute independently, on a deadline, and under pressure
#J-18808-Ljbffr