Overview
We are seeking a passionate HPC engineer. The ideal candidate will have extensive hands-on experience delivering HPC services to a high quality, and be able to relate to the scientific community and work closely with users to make the best use of research computing services. The HPC landscape is continually evolving; you will help build and operate industry-leading capabilities, including application build frameworks, containerised applications and cloud software-as-a-service. Automated deployment is a key feature, and you will be comfortable with DevOps processes and delivering consistency through automation and infrastructure-as-code.
Key Responsibilities
* Design, implement, and maintain robust platform infrastructure using Infrastructure as Code (IaC) tools such as Terraform, ensuring secure and scalable environments in our private cloud ecosystem.
* Develop, deliver and operate research computing services and applications.
* Take a Site Reliability Engineering approach to HPC services, managing the development deployment, monitoring and incident response end-to-end.
* Solve complex technical problems, both with HPC services and the user’s use of them.
Essential Knowledge, Skills, and Experience
* Hands-on experience operating, crafting or engineering large-scale computing environments, such as HPC, HTC or similar clusters
* Drive innovative computational solutions and exploit emerging technologies
* Experience in administration of large-scale cluster and server computing and related technologies
* Software experience (e.g. Slurm, LSF, Grid Engine)
* Hands-on experience working in a DevOps team and using agile methodologies
* Operating and consuming virtualized private cloud resources (e.g. OpenStack)
* Understanding of Linux system administration, the TCP/IP stack, and storage subsystems
* Experience in implementing and administering large-scale parallel filesystems (e.g. GPFS, Lustre)
* Proven experience using configuration management tools (e.g. Ansible, Salt, Puppet) and technology frameworks in IT operations
* Experience of developing and managing relationships with 3rd party suppliers
* Scripting and tool development for HPC & DevOps style platform operations using Bash and Python
Desirable Skills and Knowledge
* Scientific degree or experience in computationally intensive analysis of scientific data
* Previous experience in HPC environments, especially at large scales (>10,000 cores)
* Experience with public cloud infrastructure (AWS, Azure, GCP) is a plus
* Managing a virtualized private cloud environment (e.g. OpenStack) is a plus
* Container technologies (LXD, Singularity, Docker, Kubernetes) are a plus
* Development experience with multiple programming languages/tools (Java/C++, Python/Ruby/Perl, SQL, AWS) is a plus
* Experience with HashiCorp tools (Terraform, Vault, Consul, Nomad) is a plus
* Experience with high-speed networks (e.g. InfiniBand) is a plus
Seniority level
* Mid-Senior level
Employment type
* Contract
Job function
* Engineering and Consulting
* Industries: IT Services and IT Consulting
#J-18808-Ljbffr