Description
Own and evolve a Proactive Resilience product/capability that anticipates, prevents, and mitigates technology and service disruption. You’ll translate resilience outcomes (availability, recoverability, performance, operational readiness) into a clear product roadmap, measurable value, and repeatable adoption across platforms and teams.
Key responsibilities include:
Product strategy & roadmap
- Define product vision, target users and a prioritised roadmap aligned to business services.
- Maintain a clear backlog of resilience features Outcome-driven delivery
- Set OKRs/KPIs for proactive resilience.
- Maintain a Community of Practice to surface potential resilience improvements, maintained and prioritised via a backlog
Resilience-by-design
- Embed resilience enhancements into SDLC and change processes (non-functional requirements, release readiness, operational acceptance).
- Champion practices such as chaos engineering, game days, fault injection, capacity and performance testing, and DR readiness.
Observability & insights
- Partner with monitoring/observability teams to improve telemetry, alert quality, and actionable dashboards.
- Use data to identify systemic risks, recurring failure modes, and “top offenders” across services.
Automation & operational excellence
- Prioritise automation for detection, triage, and remediation.
Stakeholder management
- Align engineering, operations, architecture, risk, and business stakeholders on resilience priorities.
- Communicate progress and risk clearly to snr leadership; manage dependencies and delivery risks.
Governance & controls
- Ensure the product supports relevant operational resilience expectations (, impact tolerances, testing evidence, auditability).
- Maintain documentation, controls evidence, and reporting suitable for risk and assurance audiences.
Required xp & skills
Product ownership/management xp in platform, SRE or operational resilience domains.
Strong understanding of:
- Operational Resilience
- SRE principles (SLO/SLI), incident/problem management, and service management.
- Resilience patterns (redundancy, graceful degradation).
- DR/BCP concepts (RTO/RPO), high availability, and dependency management.
Data-driven decision-making: ability to use incident, change, and telemetry data to prioritise.
Agile delivery expertise (Scrum/Kanban), backlog management, and stakeholder communication.
Desirable
Familiarity with resilience patterns and platform engineering.
xp running game days/chaos experiments and translating findings into engineering work.
Financial services xp and comfort working with risk, compliance, and audit partners.
Skills
1. Product Ownership
2. Product Management
3. Operational Resilience
4. Technology
5. Disaster Recovery
6. Resilience
7. Proactive Resilience
8. Product Roadmapping
9. SRE Principles
10. SLO
11. SLI
12. Incident management
13. problem management
14. service management
15. DR
16. BCP
17. RTO
18. RPO
19. Dependency Management
Job Title: Product Owner - Operational Resilience
Location: Sheffield, UK
Job Type: Contract
Trading as TEKsystems. Allegis Group Limited, Maxis 2, Western Road, Bracknell, RG12 1RT, United Kingdom. No. 2876353. Allegis Group Limited operates as an Employment Business and Employment Agency as set out in the Conduct of Employment Agencies and Employment Businesses Regulations 2003. TEKsystems is a company within the Allegis Group network of companies (collectively referred to as "Allegis Group"). Aerotek, Aston Carter, EASi, Talentis Solutions, TEKsystems, Stamford Consultants and The Stamford Group are Allegis Group brands.