Senior Recovery Lead and Head of Service Reliability
Brand: HSBC
Area of Interest: Technology
Location: Sheffield, GB, S1 4NB
Work style: Office Worker
Date: 21 May 2025
Join a digital first bank that’s powered by people.
Our technology team builds innovative digital solutions rapidly and at scale to deliver the next generation of banking services for our customers around the world.
Service Management’s purpose is to protect the availability, integrity, and confidentiality of IT Services that underpin customer and colleagues' experience of the HSBC brands. It is a multi-functional team comprising Change Management, Incident Management, Problem Management, Service Level Management, Outage Management, Service Recovery, and Service Insights and Reporting.
We are seeking a senior technology leader to take on the dual role of Senior Recovery Lead and Global Head of Service Reliability. This is a highly visible, high-impact position reporting to the Global Head of Service Management, with a mandate to transform how we recover from incidents and build long-term service resilience.
This individual will lead a global team of technical experts who act as escalation partners during major incidents—helping reduce time to recover (TTR) through technical engagement, coordination, and engineering-driven solutions. Beyond recovery, this leader will also own the strategic and tactical roadmap for building reliable, self-healing systems through collaboration with Problem Management, SRE, and Platform teams.
Job Responsibilities:
1. Incident Recovery Leadership:
o Lead a global, follow-the-sun team that acts as technical escalation during major incidents.
o Partner with Incident Managers and Service Owners to accelerate diagnosis and resolution, reducing TTR.
o Bring calm, coordination, and engineering clarity during high-pressure recovery efforts.
2. Systemic Cause Elimination: Collaborate with Problem Managers, SRE, and Platform Engineering teams to identify and eliminate systemic causes of incidents.
3. Remediation Plans: Own and drive long-term plans including automation, reliability engineering, and platform guardrails.
4. Follow-up Actions: Track and govern actions to ensure accountability and reduction in incident recurrence.
Service Reliability Engineering Strategy:
1. Define and implement resilience strategies, including self-healing capabilities and automation.
2. Embed operational excellence into engineering workflows.
3. Influence system design with reliability in mind.
Incident Scenario Planning: Own the incident scenario framework, conduct resilience drills, and ensure readiness for complex failures.
Leadership & Culture:
1. Build and lead a high-performing team with technical expertise and a culture of ownership.
2. Promote a blameless, learning-focused environment.
3. Act as a trusted partner across functions.
Qualifications & Skills:
* Experience in Site Reliability Engineering, Infrastructure, DevOps, or Technical Operations.
* Leadership of global technical teams in high-scale environments.
* Expertise in incident recovery, automation, and systems design.
* Knowledge of problem management, root cause analysis, and resilience principles.
* Experience with resilience exercises and chaos engineering.
* Comfort in regulated environments and with Risk and Compliance.
* Excellent stakeholder management and communication skills.
* Technical depth across infrastructure, applications, and cloud-native architectures.
* Proven recovery leadership and strategic resilience thinking.
* Ability to drive cultural and engineering change.
* Adept at cross-functional collaboration.
This role is based in Sheffield.
We value diversity and inclusion and are committed to creating accessible workplaces. If you require accommodations during the recruitment process, please contact our Recruitment Helpdesk.
#J-18808-Ljbffr