Lead site reliability engineer

Sunderland

Tombola job board

Site reliability engineer

Posted: 4 September

Offer description

Lead Site Reliability Engineer (SRE) - Sunderland (Hybrid)

tombola Sunderland, England, United Kingdom

Overview

At tombola, we pride ourselves on building our own exceptional games and platforms in-house. That means keeping everything running flawlessly is paramount. We're seeking a Lead Site Reliability Engineer (SRE) to join us and help ensure our critical systems and services are always reliable, available, and performing at their best.

Responsibilities

* Team Leadership and Development: Providing leadership, management, and development for direct reports through effective 1-to-1s, objective setting (OKRs), and performance management. Making team goals clear and ensuring they align with our broader business objectives. Collaborating with other teams and departments to achieve shared success. Partnering with our People Partner for tech to build robust team management practices.
* System Reliability and Availability: Ensure system uptime, monitor and maintain the availability and reliability of critical systems and services, meeting uptime SLAs. Incident management: quickly respond to incidents, investigate root causes, and ensure effective postmortems and continuous improvement processes. Failure detection and response: proactively identify potential failures or performance bottlenecks before they impact users, and respond to failures and outages effectively.
* Monitoring and Alerting: Implement monitoring systems (e.g., Dynatrace) for application performance, infrastructure health, and system metrics. Alerting: create and manage alerting systems to notify us about issues or potential risks in a timely manner, minimizing impact on our players. Metrics collection: define and track key metrics (e.g., uptime, latency, request rates) to measure system health and performance.
* Incident Response: Incident resolution to minimize downtime and restore service as fast as possible. Post-incident analysis: after resolving incidents, perform root cause analysis (RCA), including post-incident reviews, and document findings to prevent similar issues in the future.
* Automation and Efficiency: Automate manual tasks to boost efficiency, reduce human errors, and accelerate delivery. Infrastructure automation: utilise Terraform, Git, and TeamCity to automate infrastructure provisioning and configuration management. Deployment pipelines: help develop and maintain automated deployment pipelines (CI/CD) to streamline releases and reduce manual intervention.
* Capacity Planning and Scaling: Plan for scalability to meet demand, both horizontally and vertically. Optimize resource usage (CPU, memory, storage). Forecasting and capacity planning: analyze usage trends and plan for future capacity needs.
* Performance Optimization: Performance tuning, load testing and stress testing to ensure systems can handle high traffic and performance demands without failure.
* Infrastructure Management: Cloud infrastructure management: manage AWS resources with proper scaling, cost optimization, resilience and environment consistency. Disaster recovery planning: develop and maintain disaster recovery strategies to ensure quick recovery in case of catastrophic failures.
* Collaboration with Development Teams: Work with development teams to ensure features are designed with reliability, scalability, and operational needs in mind. Service ownership: promote a culture of service ownership where developers are involved in operational aspects. Performance and reliability feedback: provide feedback to developers about production issues and recommend improvements.
* Security and Compliance: Security best practices, compliance and auditing, access control and monitoring for production environments.
* Documentation: Create and maintain detailed documentation for infrastructure components, incident response procedures, and runbooks.
* Continuous Improvement: Iterate improvements and seek new technologies to enhance reliability, performance, and efficiency.

Seniority and Employment

* Seniority level: Mid-Senior level
* Employment type: Full-time
* Job function: Engineering and Information Technology
* Industries: IT Services and IT Consulting

Note: Referrals and location indicators are typical job-posting content and not required for the role.

#J-18808-Ljbffr

Apply

Create E-mail Alert

Save

Similar job

Site reliability engineer (sre) sunderland - hy... · sunderland, uk ·

Sunderland

Tombola

Site reliability engineer

Similar job

Lead site reliability engineer (sre)

Sunderland

tombola

Site reliability engineer

Similar job

Lead site reliability engineer sunderland, uk

Sunderland

Tombola

Site reliability engineer