Site Reliability Engineer Lead (SRE) – AWS | Observability | Incident Management
Robert Half International (an S&P 500 global staffing provider) is supporting a global consulting firm in sourcing an interim SRE Lead for a major financial services engagement. This role will focus on improving platform reliability, stabilising production environments, and embedding best-in-class SRE practices across a complex, high-availability estate.
Assignment Details
* £500–£550 p/day via PAYE PLUS additional 12.07% daily holiday pay on top. (employer’s NI & tax deducted at source – unlike umbrella companies and no umbrella company admin fees)
* Initial 6 month contract
* Hybrid working – 2–3 days per week in the City of London
* Start date: c.2–4 week turnaround with anticipated start date with onboarding paperwork of w/c 01/05
Key Responsibilities
* Lead and improve incident management processes (detection, triage, escalation, resolution)
* Drive major incident response (P1/P0) and post-incident reviews (blameless postmortems)
* Define and implement SRE principles including SLIs, SLOs, SLAs and error budgets
* Build and enhance observability frameworks across metrics, logs and tracing
* Drive automation and reduction of manual toil across operational processes
* Implement runbooks, playbooks and operational readiness standards
* Work closely with engineering, platform and security teams to embed reliability into delivery
* Support the design of resilient, highly available systems (failover, DR, multi-region)
Key Skills & Experience
* Proven experience in SRE, DevOps or Platform Engineering roles within complex environments
* Strong hands-on experience with incident management and production support at scale
* Deep experience with observability tooling (e.g. Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
* Solid experience with AWS cloud environments (EKS/ECS, Lambda, API Gateway, etc.)
* Experience with CI/CD pipelines, automation and Infrastructure as Code (Terraform, Ansible, etc.)
* Strong understanding of system reliability, performance and resilience engineering principles
* Experience working in regulated or high-availability environments (financial services preferred)
Nice to Have
* Experience with chaos engineering or resilience testing
* Exposure to AIOps or intelligent automation frameworks
* Experience transitioning or improving outsourced / offshore support models
All candidates will be required to complete standard screening checks including Right to Work, financial background checks and last 5 years referencing.