Length: 6-Months (potential extension)
We’re working with an established, well-known bank that’s investing heavily in observability, operational resilience, and monitoring maturity across its technology estate. They’re now looking for an Engineering Lead with deep expertise in Prometheus and Grafana to help uplift monitoring, alerting, and end-to-end service visibility across critical banking platforms.
What you’ll be doing:
* Collaborating with Application Stewards and SREs to define and validate critical assets in scope for monitoring
* Analysing Prometheus scrape coverage, exporter deployment, and Grafana dashboard availability
* Improving monitoring configuration, alert quality, dashboards, KPIs, SLIs, and SLOs
* Helping define clear roles and responsibilities around observability, aligned to operational resilience standards
* Delivering automated, end-to-end business flow visibility using Grafana (service maps, dependency visualisation, topology views)
* Ensuring alerting is reliable, actionable, and noise-optimised using Alertmanager best practices
What you’ll need to have:
Strong hands‑on experience with Prometheus, including:
* Advanced PromQL for analysis and performance troubleshooting
* Recording rules, alerting rules, and metric optimisation
* High availability architectures, sharding, federation, and long-term storage (e.g. Thanos, Cortex, Mimir)
Deep experience with Grafana, including:
* High-quality dashboard and panel design
* Alerting configuration and routing best practices
* Synthetic monitoring (e.g. Blackbox Exporter / Grafana Synthetic Monitoring)
* Log ingestion and analysis (e.g. Loki)
* Real User Monitoring or web telemetry integrations (e.g. Grafana Faro)
* APIs and automation for dashboards, alerts, and data ingestion
* Integrating metrics, logs, and traces (e.g. Loki, Tempo, OpenTelemetry)
Nice to have:
* Experience using observability intelligence or anomaly detection
* Correlating metrics, logs, and traces for deep root cause analysis
* Providing predictive insights to reduce operational risk
#J-18808-Ljbffr