Lead site reliability engineer

Sunderland

tombola

Site reliability engineer

€100,000 - €125,000 a year

Posted: 13 June

Offer description

Tombola Sunderland, England, United Kingdom

Join or sign in to find your next job

Join to apply for the Lead Site Reliability Engineer role at tombola

tombola Sunderland, England, United Kingdom

1 week ago Be among the first 25 applicants

Join to apply for the Lead Site Reliability Engineer role at tombola

Lead Site Reliability Engineer (Lead SRE)

Ready to keep things running smoothly? Join our tombola team!

At tombola, we pride ourselves on building our own exceptional games and platforms in-house. That means keeping everything running flawlessly is paramount! We're seeking a Lead Site Reliability Engineer (SRE) to join us and help ensure our critical systems and services are always reliable, available, and performing at their best.

What will you be doing?

As an SRE, you'll be instrumental in implementing automation, monitoring, and incident response strategies to minimize downtime and optimize our operations. You'll collaborate closely with our development, infrastructure, and security teams, balancing exciting new feature delivery with rock-solid system stability.

Key Accountabilities and Responsibilities:

Team Leadership and Development

* Providing leadership, management, and development for direct reports through effective 1-to-1s, objective setting (OKRs), and performance management.
* Making team goals clear and ensuring they align with our broader business objectives.
* Collaborating with other teams and departments to achieve shared success.
* Partnering with our People Partner for tech to build robust team management practices.

System Reliability and Availability

* Ensure system uptime: Monitor and maintain the availability and reliability of critical systems and services, meeting all uptime SLAs (Service Level Agreements).
* Incident management: Quickly respond to incidents, investigate root causes, and ensure effective postmortems and continuous improvement processes are in place.
* Failure detection and response: Proactively identify potential failures or performance bottlenecks before they impact users, and respond to failures and outages effectively.

Monitoring and Alerting

* Implement monitoring systems: Set up and maintain robust monitoring systems (e.g., Dynatrace) for application performance, infrastructure health, and system metrics.
* Alerting: Create and manage alerting systems to notify us about issues or potential risks in a timely manner, minimizing impact on our players.
* Metrics collection: Define and track key metrics (e.g., uptime, latency, request rates) to measure system health and performance.

Incident Response

* Incident resolution: Work quickly to resolve incidents, minimize downtime, and restore service as fast as possible.
* Post-incident analysis: After resolving incidents, perform root cause analysis (RCS), including a post-incident review, and document findings to prevent similar issues in the future.

Automation and Efficiency

* Automate manual tasks: Automate repetitive operational tasks to boost efficiency, reduce human errors, and accelerate delivery.
* Infrastructure automation: Utilise Terraform, Git, and TeamCity to automate infrastructure provisioning and configuration management.
* Deployment pipelines: Help develop and maintain automated deployment pipelines (e.g., CI/CD) to streamline releases and reduce manual intervention.

Capacity Planning and Scaling

* Plan for scalability: Ensure our systems can scale efficiently to meet demand, both horizontally (adding more servers) and vertically (increasing server resources).
* Optimize resource usage: Monitor and optimize resource usage (CPU, memory, storage) to ensure the infrastructure handles increasing loads without waste.
* Forecasting and capacity planning: Analyze usage trends and plan for future capacity needs to ensure systems remain reliable and perform under varying workloads.

Performance Optimization:

* Performance tuning: Continuously analyze system performance and make optimizations (e.g., tuning databases, improving response times, optimizing caching).
* Load testing and stress testing: Conduct load and stress tests to ensure systems can handle high traffic and performance demands without failure.

Infrastructure Management:

* Cloud infrastructure management: Manage AWS cloud resources and ensure proper scaling, cost optimization, resilience and environment consistency.
* Disaster recovery planning: Develop and maintain disaster recovery strategies to ensure quick recovery in case of catastrophic failures.

Collaboration with Development Teams:

* Work with development teams: Collaborate with development teams to ensure that new features and applications are designed with reliability, scalability, and operational needs in mind.
* Service ownership: Promote a culture of service ownership where developers are involved in operational aspects, ensuring they are accountable for the operational success of their services.
* Performance and reliability feedback: Provide feedback to developers about performance and reliability issues encountered in production and recommend improvements.

Security and Compliance:

* Security best practices: Implement security practices within the infrastructure and application lifecycle to minimize security vulnerabilities.
* Compliance and auditing: Ensure systems are compliant with industry standards, regulations, and internal security policies.
* Access control and monitoring: Ensure proper access control mechanisms are in place to protect production environments.

Documentation:

* Documentation of processes: Create and maintain detailed documentation for all infrastructure components, incident response procedures, and runbooks to ensure efficient operations.

Continuous Improvement:

* Iterative improvements: Continuously evaluate and improve system reliability, performance, and efficiency, seeking new technologies or approaches to enhance operational effectiveness.

Seniority level

* Seniority level

Mid-Senior level

Employment type

* Employment type

Full-time

Job function

* Job function

Engineering and Information Technology
* Industries

IT Services and IT Consulting

Referrals increase your chances of interviewing at tombola by 2x

Sign in to set job alerts for “Site Reliability Engineer” roles.

Newcastle Upon Tyne, England, United Kingdom 5 days ago

Newcastle Upon Tyne, England, United Kingdom 19 hours ago

Newcastle Upon Tyne, England, United Kingdom 5 days ago

Newcastle Upon Tyne, England, United Kingdom 4 weeks ago

Newcastle Upon Tyne, England, United Kingdom 5 days ago

Newcastle Upon Tyne, England, United Kingdom 4 weeks ago

Newcastle Upon Tyne, England, United Kingdom 1 week ago

Newcastle Upon Tyne, England, United Kingdom 2 months ago

Middlesbrough, England, United Kingdom 4 months ago

Sunderland, England, United Kingdom 2 months ago

Newcastle Upon Tyne, England, United Kingdom 4 weeks ago

Newcastle Upon Tyne, England, United Kingdom 1 week ago

Newcastle Upon Tyne, England, United Kingdom 1 week ago

Science Central, England, United Kingdom 1 week ago

Newcastle Upon Tyne, England, United Kingdom 1 week ago

Tyne And Wear, England, United Kingdom 1 month ago

Newcastle Upon Tyne, England, United Kingdom 6 days ago

Wideopen, England, United Kingdom 4 days ago

Newcastle Upon Tyne, England, United Kingdom 7 minutes ago

Newcastle Upon Tyne, England, United Kingdom 4 weeks ago

Newcastle Upon Tyne, England, United Kingdom 1 day ago

Durham, England, United Kingdom 1 day ago

Durham, England, United Kingdom 1 day ago

Newcastle Upon Tyne, England, United Kingdom 1 day ago

Newcastle Upon Tyne, England, United Kingdom 1 week ago

Software Engineer, Production Engineering (DBPE)

Guide Post, England, United Kingdom 3 weeks ago

Newcastle Upon Tyne, England, United Kingdom 2 weeks ago

Newcastle Upon Tyne, England, United Kingdom 4 months ago

Newcastle Upon Tyne, England, United Kingdom 3 weeks ago

Tyne And Wear, England, United Kingdom 6 months ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

Apply

Create E-mail Alert

Save

Similar job

Site reliability engineer - multi cloud

Newcastle Upon Tyne (Tyne and Wear)

JR United Kingdom

Site reliability engineer

€100,000 - €125,000 a year

Similar job

Site reliability engineer (equity only 0.5%)

Newcastle Upon Tyne (Tyne and Wear)

JR United Kingdom

Site reliability engineer

€100,000 - €125,000 a year

Similar job

Lead site reliability engineer sunderland, uk

Sunderland

Tombola

Site reliability engineer

€100,000 - €125,000 a year