Site reliability engineer

Braehead

FBI &TMT

Posted: 21h ago

Offer description

Role Summary

As a Site Reliability Engineer (SRE) for our Data Platform, you will be the guardian of our mission-critical data infrastructure. You will bridge the gap between software engineering and systems operations to ensure our cloud-native environment-built on AWS, Snowflake, and Databricks -is scalable, resilient, and highly available. Your mission is to treat operations as an engineering problem, using automation to eliminate toil and driving a 'reliability-first' culture across our data ecosystem.

Key Responsibilities

Infrastructure as Code (IaC): Design and maintain automated provisioning and configuration management for AWS and data platform components using Terraform or CDK .

Resiliency & Disaster Recovery: Lead the strategy for high availability. You will design and execute DR drills, failure-mode testing, and recovery validation to ensure data integrity during outages.

Reliability Engineering: Define and monitor SLIs, SLOs, and SLAs. You will manage error budgets to balance the velocity of data engineering with the stability of the platform.

Observability: Implement comprehensive monitoring, logging, and tracing (using tools like CloudWatch, Datadog, or Grafana) to provide deep visibility into Snowflake and Databricks workloads.

Incident Management & RCA: Lead the response to platform incidents. You won't just fix the problem; you will perform deep-dive Root Cause Analysis (RCA) to ensure the same issue never happens twice.

Toil Reduction: Identify manual operational tasks and automate them out of existence, improving the developer experience for our data scientists and analysts.

TPBN1_UKTJ

Apply

Create E-mail Alert

Save

Similar job

Site reliability engineer iii

Glasgow (North Lanarkshire)

Site reliability engineer

Similar job

Site reliability engineer iii

Glasgow (North Lanarkshire)

Site reliability engineer