Overview
bet365 Stoke-On-Trent, England, United Kingdom
Site Reliability Engineer position at bet365 in Stoke-On-Trent, United Kingdom. bet365 is a leading online gambling company with a global presence and a focus on reliability and innovation in software. This role emphasizes improving system reliability, observability, and incident resolution through engineering practices.
Responsibilities
As a Site Reliability Engineer, you will:
* Enhance system reliability, observability, and performance through an engineering-driven approach.
* Monitor the health, performance, and availability of critical systems and directly impact operational efficiency.
* Implement solutions that improve reliability, including service instrumentation with tools such as OpenTelemetry, improve logging practices, and develop maintainable features.
* Develop tools and automation for effective service management.
* Collaborate across multiple functions to integrate reliability and observability best practices into the software development lifecycle.
* Support governance standards set by central teams to ensure reliability principles are embedded in development.
* Contribute to ensuring systems meet user demands and enhance overall service performance.
* Participate in the company’s hybrid working from home policy where applicable.
Qualifications
* Excellent knowledge of Site Reliability Engineering principles, including the creation and management of SLIs and SLOs for reliability and customer satisfaction.
* Experience with modern observability tools and practices (e.g., Splunk, New Relic, Grafana, PagerDuty).
* Experience with modern software development techniques and lifecycles.
* Experience with Infrastructure as Code (IaC) automation and orchestration tools (e.g., Ansible, Terraform).
* Prior experience in a large-scale, 24/7 enterprise where uptime and stability are critical.
* Keen interest in industry trends, particularly Platform Engineering.
* Proficiency in shell scripting for automation and system management tasks.
Additional Information
* Contribute to code that enhances reliability and observability, including telemetry and tooling.
* Develop and maintain tools to improve operational efficiency and resilience.
* Use automation and orchestration platforms to reduce toil and manual activity.
* Build dashboards using telemetry data and technologies like Grafana, Splunk, and New Relic.
* Maintain and administer existing monitoring and analytics toolsets.
* Mentor colleagues in new technologies or practices.
* Participate in live incident resolution and post-mortem analyses with remediation strategies to prevent recurrence.
* Drive initiatives to enhance system reliability and observability and contribute to a culture of continuous improvement.
* Collaborate with central SRE and Observability teams to uphold reliability standards and assist teams in adherence.
* Work with IT Operations to support tooling that delivers business value.
By applying to bet365, you agree to share your Personal Data in accordance with our Recruitment Privacy Notice - https://www.bet365careers.com/privacy-policy
Bet365 is committed to creating an inclusive environment where everyone can grow and develop. If you need adjustments or accommodations during the recruitment process, please reach out.
#J-18808-Ljbffr