Site reliability engineer

Hove

Infoplus Technologies UK Limited

Posted: 18 February

Offer description

Site Reliability Engineer
1 year FTC
Hove, UK (Hybrid)

Description:

SRE will play a pivotal role in driving the modernization of IT operations by implementing observability practices and automating toil. This position requires a deep understanding of Site Reliability Engineering (SRE) principles, modern observability tools, and automation techniques to ensure scalability, reliability, and efficiency in IT systems. This role requires a strategic thinker with hands-on expertise who can lead modernization efforts while fostering a culture of reliability and innovation.

Primary Responsibilities:

• Work closely with Product Engineering team and implement strategies for modernizing IT operations enhancing observability and toil reduction.

• Architect and deploy observability platforms to monitor system health, performance, and reliability effectively.

• Propose & drive strategies for AI-driven alerting and proactive anomaly detection to reduce MTTD & MTTR.

• Develop and enforce SRE best practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.

• Establish & create AIOPS roadmap for improving operational efficiency.

• Lead efforts to automate repetitive tasks (toil) using scripting, orchestration tools, and AI/ML-based solutions.

• Drive toil automation initiatives for automated incident responses & self-healing automation for achieving autonomous operations.

• Collaborate with cross-functional teams to ensure systems are scalable, resilient, and maintainable.

• Drive incident management and root cause analysis processes through automation, ensuring continuous improvement to enable autonomous operations.

• Partner with engineering, architecture, and product teams to enable shift-left engineering practices ensuring reliability.

• Mentor and guide teams on adopting SRE principles and tools.

• Advocate for a culture of reliability, automation, and continuous improvement across the organization.

Key Skills:

• Strong expertise in implementing Site Reliability Engineering (SRE) principles.

• Advanced knowledge of establishing observability using tools – Dynatrace & Datadog (primary skills).

• Proficiency in automation & scripting using Python & Ansible (primary skills).

• Strong experience with cloud platforms – AWS & Azure (primary skills).

• Solid understanding of containerization and orchestration tools like Docker and Kubernetes.

• Proficiency in cloud native distributed systems & microservices architecture.

• Exposure to AI/ML techniques for predictive analytics and automated problem resolution.

• Familiarity with CI/CD pipelines & enabling automated release & deployment engineering solutions.

• Good to have experience with chaos engineering tools like Gremlin or Chaos Monkey and implementing automation frameworks for resilience tracking.

• Ability to manage and prioritize multiple projects in a fast-paced environment.

• Strong interpersonal and communication skills to work effectively across teams.

• Excellent problem solving, analytical thinking, and adaptability.

• Strategic mindset balancing engineering excellence with business priorities.
Preferred Qualifications:

• 12+ years of experience in IT operations, SRE, or DevOps roles.

• Proven track record of SRE experience in implementing observability and automation solutions in large-scale environments.

• Certifications in cloud platforms, observability tools & other SRE related areas.

Apply

Create E-mail Alert

Save

Similar job

Lead site reliability engineer

Crawley

Permanent

James Chase

Site reliability engineer

€80,000 a year

Similar job

Azure site reliability engineer- fully remote

Brighton

Opus Recruitment Solutions

Site reliability engineer

Similar job

Senior site reliability engineer

Crawley

Ensono

Site reliability engineer