Job Title: Site Reliability Engineer
Location: Hybrid Remote – London EC2M
Contract (12 months)
Outside IR35
About the Role:
We are partnering with one of the top companies in the mobile industry to hire a Site Reliability Engineer (SRE). In this role, you will collaborate with cross-functional teams to drive the design, development, and delivery of high-performing, scalable, and reliable infrastructure and services. You’ll be responsible for building robust systems, automating operations, and enhancing observability and deployment pipelines for modern cloud-native applications.
Key Responsibilities:
* System Reliability & Performance:
* Maintain and scale critical services and infrastructure. Identify performance bottlenecks and work closely with product engineers to optimize applications.
* Kubernetes Operations:
* Administer, scale, and troubleshoot clusters in GKE, EKS, or other Kubernetes environments.
* Infrastructure as Code (IaC):
* Design and maintain scalable infrastructure using Terraform and automate deployments across public, private, or hybrid clouds (mainly AWS).
* CI/CD Pipeline Enhancement:
* Build and improve robust CI/CD pipelines to support fast and safe deployment cycles.
* Observability & Monitoring:
* Implement code-based instrumentation and telemetry. Ensure systems are observable with tools for logging, metrics, and alerting.
* Automation & Scripting:
* Write tooling and automation scripts in Python, Go, or Rust to reduce toil and manual intervention.
* Storage & Networking:
* Manage and optimise storage services like Amazon S3 or Google Cloud Storage (GCS). Resolve complex networking issues in multi-cloud environments.
Essential Requirements:
* 5+ years of hands-on experience as a Site Reliability Engineer.
* Proven expertise in Kubernetes (GKE/EKS).
* Strong proficiency in Python, Go, or Rust.
* Solid experience with AWS and Infrastructure as Code using Terraform.
* Deep understanding of Linux internals, standard networking protocols, and distributed systems architecture.
* Hands-on experience with automation and performance optimisation.
* Strong knowledge of SRE principles and methodologies.
* Experience with observability tools and telemetry systems.
* Exposure to Google Cloud Platform (GCP).
* Familiarity with hybrid or multi-cloud architecture.
* Experience with service meshes or edge proxies (e.g., Envoy, Istio).
* Working knowledge of container security best practices.