Role Purpose
Lead Level 3 support for critical digital services, ensuring high availability, fast incident recovery, and long-term resilience. Drive root cause elimination, design supportable architectures, oversee major changes, and mentor support teams. Ensure alignment with DDaT, DevOps, and Home Office service expectations.
Key Outcomes & Responsibilities
* Major Incident Leadership: Act as technical lead for P1/P2 incidents, coordinating recovery and communication.
* Root Cause Ownership: Lead formal RCAs, define corrective actions, and ensure follow-through via sprints/releases.
* Change & Release Governance: Review technical change plans, lead high-risk deployments, and support hotfix releases.
* Availability & Performance: Improve reliability through proactive monitoring, self-healing automation, and architectural enhancements.
* Environment Strategy: Maintain stable non-production environments and collaborate with environment management teams.
* Service Performance: Drive SLA achievement, service reviews, metrics analysis, and proactive improvements.
* Shift Left & Knowledge Management: Develop high-quality runbooks, automate manual tasks, and train L1/L2 teams.
* Transition Support: Provide documentation, KT, and pairing during onboarding/offboarding of support teams.
* Technical Leadership: Mentor engineers and collaborate with product, DevOps, and development teams.
Essential Skills (Must Have)
* Deep expertise in distributed systems, Java, JavaScript, microservices, APIs, and cloud platforms.
* Strong debugging skills using logs, metrics, traces, and profiling tools.
* Experience with CI/CD tooling and release management.
* Strong scripting and automation capabilities.
* Ability to lead technical bridges under pressure.
Desirable Skills (Nice To Have)
* Advanced cloud knowledge (AWS professional level).
* Experience with container orchestration (Kubernetes, ECS, AKS).
* Knowledge of reliability engineering practices (SRE).
* Experience improving infrastructure via IaC.
* Ability to contribute to architecture decisions.
Experience Profile
* 5–10+ years in Level 3 support, DevOps engineering, or SRE roles.
* Significant experience managing critical systems with high availability requirements.
* Proven leadership in major incidents and change governance.
Ways of Working
* Operates within Agile product teams with DevOps principles.
* Leads service reviews, problem boards, and continual improvement cycles.
* Coaches and mentors engineering teams.
Location & Security
UK-based, hybrid working as agreed with Client; SC eligibility is required.
Certification (Preferred)
* AWS/Azure Professional
* SRE or DevOps Practitioner Certifications
#J-18808-Ljbffr