About the Role
We are seeking a skilled Site Reliability Engineer to join our team. This role involves ensuring the reliability, scalability, and performance of our cloud infrastructure and applications.
The ideal candidate will have experience with SRE principles, designing and implementing robust observability, monitoring, and logging solutions.
Responsibilities:
* Ensure system reliability, performance, and scalability through monitoring and automation
* Design and implement effective logging, monitoring, and alerting strategies
* Proactively identify and resolve performance bottlenecks and infrastructure issues
* Automate infrastructure provisioning, configuration management, and deployments
* Implement high-availability and fault-tolerant solutions
* Work with DevOps engineers to streamline CI/CD pipelines and automate testing
What We're Looking For
* Experience with SRE principles, including incident management, error budgets, and service-level objectives (SLOs)
* Strong proficiency with observability and monitoring tools like Grafana, Prometheus, and Loki
* Experience with distributed tracing and telemetry tools like OpenTelemetry
* Understanding of cloud networking architecture and load balancing techniques
* Experience with container orchestration platforms like Kubernetes
* Proficiency in infrastructure as code (IaC) tools like Terraform or Ansible
Benefits
* 25 days holiday rising to 30 with each year of service
* Private Medical Insurance covering dental and optical
* Company pension scheme
* Life Assurance – 4x your annual salary
* 1 day paid volunteering per year
* Enhanced maternity/paternity offerings
* Employee Assistance Programme
* Cycle to work scheme
* On-site gym
Our Culture
We foster a collaborative and supportive culture that encourages continuous learning and improvement.
We are committed to promoting Diversity & Inclusion and Social Responsibility.
Our DE&I group and charitable initiatives contribute positively to our community.