Lead Site Reliability Engineer | Fully Remote | AWS, Kubernetes, Terraform | High-Scale SaaS | £90K
Company Overview
High-growth SaaS platform operating at national scale, powering critical services for thousands of customers daily. Engineering-led, cloud-native, and focused on delivering highly reliable, scalable distributed systems. This will be a senior leadership role driving SRE strategy, platform scalability, and operational excellence. You’ll own reliability, performance and automation across multiple engineering teams, evolving the platform to handle rapid growth.
Key Responsibilities & Experience
* Define and scale SRE practices across product teams.
* Ideally from a technical / SWE background
* Own system design for reliability, scalability and performance.
* Lead platform reliability, availability and incident management.
* Drive automation, IaC, observability and continuous improvement.
* Guide root cause analysis and implement resilience strategies.
* Mentor and technically lead SRE / Platform engineers.
* Support large-scale re-architecture, capacity planning and FinOps alignment.
Core Technical Environment
* Cloud: AWS (high-throughput systems: 1,000–6,000+ req/sec)
* IaC: Terraform, configuration management
* Containers: Kubernetes, Docker (ECS beneficial)
* Languages: Python, Go or similar
* Observability: Prometheus, DataDog or equivalents
* CI/CD: Modern automated pipelines
* Systems: Distributed systems, microservices, resilience engineering
Lead Site Reliability Engineer | Fully Remote | AWS, Kubernetes, Terraform | High-Scale SaaS | £90K