Staff site reliability engineer - observability

Slough

Motive Group

Site reliability engineer

Posted: 10 November

Offer description

Senior / Staff Site Reliability Engineer - Observability | London (Hybrid)

If you care deeply about building and operating world-class infrastructure for AI at scale, this one’s worth your time.

We’re working with a company that builds the backbone powering some of the most demanding AI workloads on the planet. Think large-scale GPU clusters, global telemetry systems, and distributed training environments used by leading research and enterprise teams.

They’re looking for a Senior or Staff SRE with deep experience in observability at massive scale - someone who’s tuned Prometheus / Mimir, Loki, or Tempo clusters beyond 100M+ series or 10TB/day logs, and who thrives in highly technical, fast-moving environments.

You’ll be working on:

* Designing and scaling observability for globally distributed GPU infrastructure
* Building automation that cuts operational toil and improves reliability
* Partnering with platform and infrastructure teams to deliver true visibility across complex AI systems

If you’ve built or operated telemetry stacks for large-scale, GPU-heavy, or multi-tenant environments - and want to work on cutting-edge problems in a business growing faster than most can imagine then this could be your next step.

Location: London (hybrid)

You: 7+ years experience, expert in observability at scale, low ego, high ownership.

Comp: 150-200k + 1-2X salary in equity

Apply

Create E-mail Alert

Save

Similar job

Site reliability engineer

Sutton (Greater London)

La Fosse

Site reliability engineer

Similar job

Lead site reliability engineer

Woking

Blackfield Associates

Site reliability engineer

Similar job

Lead site reliability engineer

Woking

Consult

Site reliability engineer