Job Title:
Site Reliability Engineer
Location:
Hove, UK
Job Mode:
Hybrid
Job Type:
FTC
Primary Responsibilities
* Work closely with Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.
* Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.
* Propose & drive strategies for AI‑driven alerting and proactive anomaly detection to reduce MTTD & MTTR.
* Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.
* Establish & create AIOPS roadmap for improving operational efficiency.
* Lead efforts to automate repetitive tasks (toil) using scripting, orchestration tools, and AI/ML‑based solutions.
* Drive toil automation initiatives for automated incident responses & self‑healing automation for achieving autonomous operations.
* Collaborate with cross‑functional teams to ensure systems are scalable, resilient, and maintainable.
* Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.
* Partner with engineering, architecture, and product teams to enable shift‑left engineering practices ensuring reliability.
* Mentor and guide teams on adopting SRE principles and tools.
* Advocate for a culture of reliability, automation, and continuous improvement across the organization.
Key Skills
* Strong expertise in implementing Site Reliability Engineering (SRE) principles.
* Advanced knowledge of establishing observability using tools – Dynatrace & Datadog (primary skills).
* Proficiency in automation & scripting using Python & Ansible (primary skills).
* Strong experience with cloud platforms – AWS & Azure (primary skills).
* Solid understanding of containerization and orchestration tools like Docker and Kubernetes.
* Proficiency in cloud native distributed systems & microservices architecture.
* Exposure to AI/ML techniques for predictive analytics and automated problem resolution.
* Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.
* Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.
* Ability to manage and prioritize multiple projects in a fast‑paced environment.
* Strong interpersonal and communication skills to work effectively across teams.
* Excellent problem solving, analytical thinking, and adaptability.
* Strategic mindset balancing engineering excellence with business priorities.
Preferred Qualifications
* 12+ years of experience in IT operations, SRE, or DevOps roles.
* Proven track record of SRE experience in implementing observability and automation solutions in large‑scale environments.
* Certifications in cloud platforms, observability tools & other SRE related areas.
#J-18808-Ljbffr