Overview
The Digital and Directorate has primary responsibility for scientific computing and research computing services and support. Key functions of the Digital Development and Operations unit are to provide and support such platforms required by the staff of UKHSA and provide technical capabilities to enable public health services, both within the Organisation and between the Organisation and its customers and stakeholders.
We are seeking a highly motivated and experienced Site Reliability Engineer (SRE) to join our High Performance Computing (HPC) & SRE team. The role will be critical in ensuring the stability, scalability and performance of our services, combining software engineering and systems engineering to build, improve and run reliable, scalable production systems.
The role will report to the Principal Specialist Engineer SRE and is part of the HPC/SRE/AI & research computing unit.
Key Responsibilities
* Architect, develop and manage multi‑cloud HPC platforms and on‑premise infrastructure
* Ensure services are highly available, scalable, resilient, stable, and perform optimally
* Manage performance, capability and capacity planning, ensuring current and future workloads are supported through automation and capacity planning
* Support UKHSA's AI requirements
* Respond to incidents, troubleshoot issues and restore services promptly
* Lead root cause analysis and post‑mortems, implementing lessons learned to prevent recurrence
* Prioritise operational service improvements to meet or increase SLOs and minimise downtime
* Ensure effective monitoring/alerting is in place to proactively identify issues using tools and dashboards, reducing response times and alert fatigue
* Design and implement monitoring, alerting and observability systems to detect issues before they impact users
* Develop automation to streamline tasks, reduce overhead on repeatable operations, reduce manual intervention and improve operational efficiency through IaC and scripting
* Write maintainable, clear, concise, and well‑tested code to support automation efforts and system tooling
* Optimise system performance by identifying bottlenecks with an engineering mindset
* Improve services through observability and identify ways to enhance observability practices
* Define and evangelise SRE principles across the organisation, educating stakeholders to adopt these practices
* Maintain accurate technical documentation, runbooks and post‑incident reports; provide training and mentorship to engineering teams on best practices and tools
* Work closely with engineering, DevOps and infrastructure teams to streamline deployment, operational workflows, and promote a culture of shared responsibility for service reliability
Application Process
You will be required to complete an application form. You will be assessed on the listed 9 Essential Criteria, presented in the following format:
* Application form (details must be entered in the "Employer/Activity history" section)
* 1000‑word Statement of Suitability & Technical Statements (must not exceed 1000 words, despite the overall word limit of 1500)
Longlisting & Shortlisting
After reviewing applications, candidates will be grouped as follows:
* Meets all essential criteria – only these candidates proceed to shortlisting
* Meets some essential criteria – these are not shortlisted
* Meets no essential criteria – these are not shortlisted
Interview Process
* Remote interview
* Success Profiles assessment focusing on Behaviours and Technical Skills
* Technical test and presentation (5‑minute) on either: "Design a highly available and scalable service" or "Automate a complex operational process"
* Behavioural questions: Changing and Improving; Delivering at Pace; Managing a Quality Service; Working Together
Additional Requirements & Policies
* Knowledge of SRE principles, incident management, system design, automation/coding, Linux & networking
* Police check: If spent more than 6 months abroad in the last 3 years, an International Police Check is required
* Artificial Intelligence: use of AI in applications must be truthful and sourced from personal experience; plagiarism will lead to withdrawal
* Security clearance: Criminal record check, baseline personnel security standard (BPSS), counter‑terrorism check (CTC)
* Nationality requirements: UK nationals, citizens of the Republic of Ireland, Commonwealth, EU/Switzerland/Norway/Iceland/Liechtenstein and family members with settled or pre‑settled status, Turkish nationals and family members with accrued right, and certain individuals with limited or indefinite leave to remain who had the right to apply for the EU Settlement Scheme before 31 December 2020.
Salary & Benefits
* Salary (Grade 7):
* National: £56,185 – £66,581
* Outer London: £58,340 – £68,574
* Inner London: £60,494 – £70,566
* Market Pay Supplement: £5,000 – £10,000
* UK Health Security Agency contributes £16,276 to the Civil Service Defined Benefit Pension scheme (28.97% employer contribution)
* Learning and development tailored to your role
* Flexible working options, including hybrid working at core HQs (Birmingham, Leeds, Liverpool, London Canary Wharf) and scientific campuses (Chilton, Colindale, Porton)
* Culture encouraging inclusion and diversity
* Effective support for the minimum 60% of contractual hours (≈3 days a week) at a UKHSA scientific campus for those based there
Work Location & Flexibility
Hybrid working is central to this role, with a minimum of 60% of working hours (averaged over a month) spent at a UKHSA scientific campus (Colindale, Porton or Chilton). For staff based at a campus, a minimum Counter Terrorism Check is required.
Future Considerations
The job description and person specification may be reviewed on an ongoing basis in accordance with the changing needs of the organisation.
#J-18808-Ljbffr