Overview
Are you ready to lead high-performance computing infrastructure that accelerates discovery and brings life-changing medicines to patients faster? This is your opportunity to set the direction for a mission-critical platform that underpins data science, computational biology and AI/ML workloads across the enterprise. You will steer a modern HPC environment spanning on-premises and cloud, enabling scientists to run at scale and speed while maintaining reliability and cost efficiency. Do you thrive at the intersection of engineering rigor and scientific impact, where your decisions unlock faster insights and smarter experimentation? In this hands-on leadership role, you will be empowered to take ownership, challenge the status quo and orchestrate new possibilities—partnering closely with researchers, pushing boundaries in hackathons and ensuring our platform evolves with the demands of cutting-edge science.
Responsibilities
* Platform Roadmap: Define and own the strategic roadmap for the HPC infrastructure, aligning capacity, architecture and capabilities to scientific priorities and business outcomes.
* Operational Excellence: Drive continuous improvement of platform stability, efficiency and performance; set clear metrics and ensure reliability for large-scale workloads and time-critical studies.
* Hybrid Delivery: Lead delivery across on-premises and cloud environments, optimizing for speed, scalability and cost while ensuring seamless user experience.
* Scientific Partnership: Work with scientific users to understand their needs and translate them into robust solutions, enabling faster models, simulations and analyses.
* Technology Foresight: Scan the horizon to identify emerging technologies that keep the platform innovative and competitive, and guide timely adoption.
* Backlog Leadership: Prioritize the team's work based on scientific impact, balancing quick wins with longer-term investments to maximize value.
* People Development: Mentor and coach engineers, build capabilities and foster a high-performance, collaborative engineering culture.
* Incident Leadership: Investigate and resolve complex operational incidents, lead root-cause analysis and implement preventative improvements that strengthen the platform.
Essential Skills/Experience
* Defining the roadmap for the platform's HPC infrastructure
* Drive continuous improvement of the stability and efficiency of the platform
* Ensuring delivery of team objectives, both on-premises and in the cloud
* Working with scientific users to understand their needs, and develop solutions
* Horizon scanning, identifying the future technologies needed to stay innovative
* Prioritising the work backlog for the team according to scientific needs
* Mentoring and coaching engineers
* Investigating and resolving complex operational incidents
* Experience with HPC schedulers and workload managers (e.g., Slurm, PBS, Grid Engine) and job orchestration at scale
* Hands-on knowledge of cloud-based HPC services and architectures (e.g., Azure, AWS, GCP), hybrid networking and cost optimization
* Containerization and orchestration for scientific workloads (e.g., Docker, Singularity, Kubernetes)
* Infrastructure as Code and automation (e.g., Terraform, Ansible, CI/CD), plus strong scripting skills in Python and Bash
* Performance tuning and profiling of compute and storage, including GPUs, accelerators and parallel filesystems (e.g., Lustre, Spectrum Scale)
* Workflow management tools (e.g., Nextflow, Snakemake) and data pipelines supporting AI/ML and bioinformatics
* Robust approach to reliability engineering, observability and security/compliance for sensitive research data
* Proven stakeholder engagement across research, engineering and product teams, with the ability to balance speed, quality and sustainability
#J-18808-Ljbffr