Our global High Performance Computing Team is looking to add a Production Engineer in our London office. The scale of our computing environments provides unique challenges in providing good performance and reliability. Several systems including compute, scheduling, networks, and large-scale data storage must integrate seamlessly to support data pipelines and quantitative research. The ideal candidate would be a hands‑on individual, highly skilled in the details and nuances of managing Linux environments with a strong software development background necessary to support uniquely customized systems at scale.
What You’ll Do:
* Design, implement, maintain, and support high performance compute and storage systems
* Implement and support performance monitoring and fault monitoring systems
* Monitor systems and storage performance, up to and including network components
* Build tooling to compile, package, install, and upgrade software and operating system components at scale
* Collaborate with team members and across teams to write code and testing infrastructures spanning both new and existing codebases in multiple programming languages
* Develop and improve systems and user documentation
* Participate in large, coordinated maintenance operations, including during evenings and weekends
* Work on global projects across a wide range of infrastructure
* Collaborate directly with researchers to optimize their use of HPC infrastructure
* Develop and monitor the tools used to maintain a production computing environment
* Provide operational support on a rotating basis and as needed
* Manage relationships with outside vendors, including traveling both domestically and internationally to meet with current and potential vendors
* Adhere to all company cybersecurity and IT policies, including performing all work using only approved hardware and software Other duties as assigned or needed
Skills You’ll Need:
* 5+ years of professional experience in high performance computing (HPC), including parallel filesystems (e.g., Lustre, GPFS), batch systems (e.g., Slurm, Grid Engine), and high-performance network interconnects experience is a plus, but not required
* 5+ years of experience with Linux systems administration
* High proficiency with at least one programming/scripting language (e.g., Go, Python, C)
* Extensive experience designing, building, and maintaining complicated, interdependent, and distributed systems
* Extensive experience profiling and debugging application stacks (debuggers and profilers)
* Experience with system configuration management tools (SaltStack, Ansible, Puppet, etc.)
* A compulsion to perform root cause analysis
* Reliable and predictable availability
Benefits include:
* Private Medical, Vision and Dental Insurance
* Travel Medical Insurance
* Group Pension Scheme
* Group Life Assurance and Income Protection Schemes
* Paid Parental Leave
* Parking and Commuter Benefits
#J-18808-Ljbffr