Site Reliability Engineer
We're working with a global technology consultancy that designs, builds, and supports modern software platforms for enterprise customers worldwide. They partner closely with clients to deliver reliable, scalable, cloud-native solutions.
The Role
As an SRE, you'll play a key role in ensuring the availability, performance, and scalability of production systems, supporting customers across the EMEA region. Helping to build, mature, and enhance the SRE function. This is a hands‑on, technical role, focused on reliability, automation, and operational excellence across a distributed, cloud-based platform.
Key Responsibilities
* Platform Reliability: Deploy, operate, and improve Kubernetes clusters across multiple cloud environments.
* Service Performance: Design and implement processes to enhance system reliability, availability, and scalability.
* CI/CD Enablement: Build and optimise CI/CD pipelines to support safe, repeatable deployments.
* Observability & Incidents: Own monitoring, alerting, and incident response to minimise downtime and speed recovery.
* Root Cause Analysis: Lead post‑incident reviews and implement long‑term preventative improvements.
* Automation: Reduce operational toil through automation and performance optimisation.
* On‑Call: Participate in weekday coverage and a once‑monthly weekend rota.
Collaboration & Stakeholder Engagement
* Work closely with engineering, infrastructure, and product teams to embed SRE best practices.
* Advocate for reliability, resilience, and operational excellence across teams.
* Collaborate with a globally distributed engineering function.
* Engage directly with customers to resolve incidents and improve user experience.
Skills & Experience
* Proven experience as an SRE or similar role, supporting complex distributed systems (5+ years).
* Strong Kubernetes experience (AKS, EKS, GKE, or similar).
* Hands‑on with observability tools such as Prometheus, Grafana, Kibana, Vector, or Superset.
* Experience with at least one major cloud platform: AWS, Azure, GCP, or Linode.
* SQL database experience (PostgreSQL beneficial but not essential).
* Proficiency in Python, Go, or Rust.
* Strong Linux expertise, including performance tuning and troubleshooting.
* Excellent communication skills, able to work effectively with engineers and customers.
Please apply now if you are meeting the above criteria, or contact Andrew Harrison directly.
#J-18808-Ljbffr