Role: SRE
Location: Hove, UK
Is it Permanent / Contract: Open for both Perm/Contract
Is it Onsite/Remote/Hybrid: 2 days per week from office
No. of Positions: 1
We are seeking an experienced Site Reliability Engineer (SRE) to drive the modernization of IT operations through the implementation of observability practices, automation, and reliability engineering principles. The role requires a strategic thinker with strong hands‑on expertise who can enhance system reliability, scalability, and operational efficiency while reducing manual operational tasks.
The successful candidate will work closely with engineering, architecture, and product teams to implement modern reliability practices, automate operational workflows, and establish robust monitoring and incident management frameworks.
Key Responsibilities
* Collaborate with engineering teams to modernize IT operations by improving observability, automation, and operational efficiency.
* Design and implement observability platforms to effectively monitor system health, performance, and reliability.
* Develop strategies for AI-driven alerting and proactive anomaly detection to reduce Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
* Establish and enforce SRE best practices, including Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
* Define and implement an AIOps roadmap to enhance operational intelligence and automation.
* Automate repetitive operational tasks (toil reduction) using scripting, orchestration tools, and automation frameworks.
* Implement self‑healing systems and automated incident response mechanisms to support autonomous operations.
* Collaborate with cross‑functional teams to ensure systems are scalable, resilient, and maintainable.
* Lead incident management, root cause analysis, and post‑incident improvement initiatives.
* Promote shift‑left reliability practices across engineering and product teams.
* Mentor team members and advocate for a culture of reliability, automation, and continuous improvement.
Required Skills & Experience
* Strong expertise in Site Reliability Engineering (SRE) principles and practices.
* Hands‑on experience implementing observability solutions, particularly with Dynatrace and Datadog.
* Strong scripting and automation experience using Python and Ansible.
* Experience working with cloud platforms such as AWS and Azure.
* Solid understanding of containerization and orchestration technologies, including Docker and Kubernetes.
* Experience working with cloud‑native distributed systems and microservices architectures.
#J-18808-Ljbffr