Requirements
Must have:
- Deep expertise in designing, implementing, and configuring modern observability stacks - specifically Prometheus, Grafana, and associated tooling. - Strong instrumentation strategy (exporters, service discovery, custom metrics). - Advanced PromQL skills for complex querying and performance analysis. - Experience building recording/alerting rules and optimizing metric ingestion. - Knowledge of HA architectures, federation, sharding, and long-term storage (Thanos, Cortex, Mimir). - Grafana Dashboard and panel design focused on performance and operator clarity. - Best-practice alert configuration and routing. - Experience with synthetic monitoring (Grafana Synthetic Monitoring, Blackbox exporter). - Log ingestion/analysis (Loki). - Familiarity with Real User Monitoring tooling (e.g., Grafana Faro). - Strong API and automation skills for dashboard provisioning, alert management, and data ingestion. - Experience integrating the Grafana/Prometheus ecosystem with logging, tracing, and event platforms (Loki, Tempo, OpenTelemetry).
Responsibilities:
- Drive the uplift, resilience, and effectiveness of our monitoring ecosystem. - Partner with engineering teams to deliver world-class insights through metrics, dashboards, alerts, and automation. - Influence standards, modernise tooling, and enhance visibility across complex distributed systems. - Collaborate with Application Stewards and SREs to validate critical assets for monitoring verification and uplift. - Analyse Prometheus scrape coverage, exporter deployment, and Grafana dashboard availability for critical services. - Identify and implement improvements across monitoring configurations, alert quality, data models, dashboards, KPIs, SLIs, and SLOs. - Review roles and responsibilities across observability functions and recommend enhancements aligned to Operational Resilience standards. - Contribute to delivering automated, end-to-end business flow visibility, surfaced in Grafana through service maps, dependency visualisation, or topology integrations. - Ensure alerting configurations are reliable, actionable, and noise-optimised, following Alertmanager best practices.
Company:
We are seeking a highly skilled Observability Engineering Lead to shape how we detect, diagnose, and prevent issues across our critical applications. This hands-on technical leadership position allows you to play a pivotal role in our team, working onsite for 2 days a week. We are committed to fostering an inclusive environment as we strive for excellence in our monitoring practices.