Senior / Staff Site Reliability Engineer - Observability | London (Hybrid)
If you care deeply about building and operating world-class infrastructure for AI at scale, this one’s worth your time.
We’re working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large-scale GPU clusters, global telemetry systems, and distributed training environments used by leading research and enterprise teams.
They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus / Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments.
You’ll be working on:
* Designing and scaling observability for globally distributed GPU infrastructure
* Building automation that cuts operational toil and improves reliability
* Partnering with platform and infrastructure teams to deliver true visibility across complex AI systems
If you’ve built or operated telemetry stacks for large-scale, GPU-heavy, or multi-tenant environments - and want to work on cutting-edge problems in a business growing faster than most can imagine then this could be your next step.
Location: London (hybrid)
You: 7+ years experience, expert in observability at scale, low ego, high ownership.
Comp: 150-200k + 1-2X salary in equity