Are you among the top 1% of Site Reliability Engineers in the UK?
Check below to see if you have what is needed for this opportunity, and if so, make an application asap.
Our client an IT Service Management company is building a world-class SRE team to support a mission-critical Java-based platform used by millions. If you’re a hands-on engineer with a background in Linux systems, deep AWS expertise, and a passion for incident response, reliability, and scale, we want to hear from you.
What You’ll Be Doing:
Own and evolve our incident management and on-call processes
Ensure uptime, scalability, and security across a massive infrastructure footprint
Work with EKS, EC2, Load Balancers, VPC, CDK, Terraform, CloudFormation
Write and maintain YAML, Python scripts, and internal tooling
Define and track SLAs, SLOs, and SLIs to drive reliability
Collaborate with platform engineers and developers to support a Java-based product
Operate in a manual, tool-light environment while helping us scale and automate
What We’re Looking For:
7–12 years of experience, with 5+ years in SRE roles
Strong Linux/System Admin foundation
Proven experience in live incident troubleshooting and root cause analysis
Deep AWS knowledge – you can speak to how you’ve used services like EKS, EC2, Load Balancers in production
Experience with monitoring, alerting, capacity planning, and security best practices
Comfortable working in large-scale environments with thousands of endpoints
Clear communicator who can document and share knowledge across teams
Able to work independently and thrive in a globally distributed team