The CoE Lead - Observability & Tools at JD Sports Fashion Plc is a critical, hands-on technical role focused on designing, building, and maintaining the company's Observability role ensures that our technology platforms operate efficiently and reliably, providing early insights for Engineering, Service Reliability, Service Delivery, and DevOps teams.
The CoE Lead will manage the contract with third-party providers responsible for the execution layer, ensuring adherence to service-level agreements (SLAs) and key performance indicators (KPIs). The position involves a 75% focus on the design of frameworks and a 25% focus on implementation and adoption.
· Job Title – Centre Of Excellence Lead- Observability & Tooling
· Location – BL9 8RR
· Working rota – Monday Friday
· Working hours – 40
What You'll Be Doing:
We are looking for an experienced CoE Lead to design, build, and maintain our Observability platform. The CoE Lead will work closely with DevOps, Engineering, Service Reliability, and Service Delivery teams to continuously improve our Observability capabilities.
This role is a technical, hands-on position with a 75% focus on framework design and 25% on implementation and adoption.
You will contribute to pipeline design, enabling observability from the first deployment in test environments and providing early insights for Engineering, Service Reliability, Service Delivery, and DevOps teams. The role involves building frameworks for intelligent alerts to help Service Delivery teams quickly triage incidents and enable automated runbooks. Additionally, you will identify and deploy tools to automate incident detection, notifications, triage, and resolution.
Key Responsibilities:
* Pipeline Approach: Adopt a pipeline approach to enable observability of services deployed across multiple environments, balancing monitoring, logging, and tracing based on service classification.
* Intelligent Alerts: Design and build intelligent alerts using pipelines, onboarding automated runbooks triggered with clear audit/logs in service management tools like Jira Service Management.
* Dashboards: Create and maintain dashboards for proactive monitoring of services to help teams resolve incidents quickly.
* Monitoring Capability: Continuously improve monitoring capabilities to identify key alerts and thresholds for early warnings before services fail.
* Automation: Enable intelligent alerts with fine-grained details of underlying services causing issues, extending to trigger automated execution of runbooks with clear audit logs.
* Collaboration: Work closely with DevOps, Service Reliability, and Service Delivery teams to identify and deploy tools that automate incident detection, notifications, triage, and resolution.
What We're Looking For:
Skills:
* Leadership and Collaboration:
o Strong leadership skills with the ability to mentor, coach, and develop high-performing teams.
o Excellent communication and interpersonal skills, capable of building strong relationships with both technical and business stakeholders.
o Proven ability to collaborate effectively with cross-functional teams, including DevOps, Engineering, Service Reliability, and Service Delivery teams.
* Technical Expertise:
o In-depth knowledge of open-source and commercial observability tools (, Prometheus, Grafana, NewRelic).
o Expertise in cloud environments (, AWS, Azure) and infrastructure as code (IaC) tools like Terraform.
* Monitoring and Observability:
o Experience in creating and maintaining dashboards for proactive monitoring of services.
o Ability to design and build intelligent alerts using pipelines, enabling early detection of issues and automated incident response.
o Knowledge of the latest technology trends in the monitoring landscape, such as OpenTelemetry.
* Contract Management:
o Experience in managing third-party provider contracts, including negotiating terms, monitoring performance, and ensuring adherence to SLAs and KPIs.
o Ability to integrate third-party providers seamlessly into the organisation's workflows, aligning with the overall strategic vision.
Experience:
* Professional Experience:
o Minimum of 5-8 years of experience in technology service delivery and management, focusing on observability, monitoring, and tooling.
* Service Management:
o Practical experience in building and maintaining a Service Catalogue, assigning service level objectives (SLOs), and measuring service level indicators (SLIs).
o Experience in operating production services during peak trading periods without service degradation.
* Automation and Tooling:
o Knowledge of automation tools to simplify alert notifications and extend to automated runbook execution.
o Experience in implementing observability solutions for retail stores or similar environments.
Proven experience in overseeing and managing Atlassian tools for effective tracking, collaboration, and service management