Overview
Job Title: Senior Recovery Lead and Global Head of Service Reliability
Location: Sheffield (Hybrid)
6 month Contract
Service Management (SM)
Service Management’s purpose is to protect the availability, integrity and confidentiality of IT Services that underpin customer and colleagues experience of the brand. It is a multi-functional team comprising of Change Management, Incident Management, Problem Management, Service Level Management, Outage Management, Service Recovery and Service Insights and Reporting.
About the Role
We are seeking a senior technology leader to take on the dual role of Senior Recovery Lead and Global Head of Service Reliability. This is a highly visible, high-impact position reporting to the Global Head of Service Management, with a mandate to transform how we recover from incidents and build long-term service resilience.
This individual will lead a global team of technical experts who act as technical escalation partners during major incidents—helping reduce time to recover (TTR) through deep technical engagement, coordination, and engineering-driven solutions. Beyond recovery, this leader will also own the strategic and tactical roadmap for building reliable, self-healing systems through collaboration with Problem Management, SRE, and Platform teams.
Key Responsibilities
* Lead a global, follow-the-sun team that acts as technical escalation partners during major incidents.
* Partner with Incident Managers and Service Owners to accelerate incident diagnosis and resolution, reducing TTR and restoring services quickly and safely.
* Bring calm, coordination, and engineering clarity to high-pressure recovery efforts.
* Collaborate with Problem Managers, Product SRE, and Platform Engineering teams to identify and eliminate systemic causes of major incidents.
* Own and drive long-term remediation plans, including automation, reliability engineering, and platform guardrails to reduce future risk.
* Track and govern follow-up actions to ensure completeness, accountability, and measurable reduction in incident recurrence.
Service Reliability Engineering Strategy
* Define and implement strategies for resilience engineering, including self-healing capabilities, automation of recovery workflows, and risk mitigation patterns.
* Advocate for operational excellence by embedding reliability standards, testing practices, and continuous improvement processes into engineering workflows.
* Partner with Architecture and Engineering leaders to influence system design with reliability in mind.
* Own the global incident scenario planning framework, ensuring that Technology is prepared to recover from widespread, complex failures.
* Design and run mass recovery simulations, chaos testing, and resilience drills to expose weaknesses and improve readiness.
* Work with regional and global risk teams to align with regulatory and operational resilience requirements.
Leadership, Influence & Culture
* Build, scale, and lead a high-performing global team with deep technical skills and a culture of urgency, ownership, and collaboration.
* Drive a blameless, learning-focused culture that emphasizes root cause thinking, accountability, and continuous improvement.
* Act as a trusted partner and thought leader across Engineering, Infrastructure, Risk, and Service Management functions.
Qualifications & Experience
* 12+ years in Technology, with proven experience in Site Reliability Engineering, Infrastructure, DevOps, or Technical Operations.
* Demonstrated experience leading global technical teams in complex, high-scale environments.
* Deep expertise in incident recovery, automation, systems design, and platform reliability.
* Strong working knowledge of problem management, root cause analysis frameworks, and resilience engineering principles.
* Experience designing and running resilience exercises, chaos engineering, or incident scenario testing at scale.
* Comfortable operating in regulated environments and partnering with Risk and Compliance functions.
* Excellent stakeholder management and communication skills, with the ability to lead through influence at senior levels.
Core Competencies
* Technical Depth – Ability to dive deep across infrastructure, applications, and cloud-native architectures.
* Recovery Leadership – Skilled in coordinating technical resources under pressure to resolve incidents rapidly.
* Reliability Thinking – Strategic mindset focused on system robustness, automation, and prevention.
* Change Agent – Drives cultural and engineering change to improve stability and accountability.
* Cross-Functional Collaboration – Adept at aligning goals and actions across engineering, operations, and risk domains.
#J-18808-Ljbffr