An established technology-driven organisation is seeking an experienced Site Reliability Engineer (SRE) in Glasgow to strengthen and scale their cloud-native data platform, utilising AWS, Snowflake, and Databricks. This position offers the opportunity to drive automation, resilience, and operational excellence across critical data services.
Key Responsibilities:
* Automate infrastructure provisioning and platform operations using Infrastructure as Code and CI/CD tools.
* Lead and execute reliability initiatives including disaster recovery planning, failure testing, and resilience validation.
* Define and manage service health metrics (SLIs/SLOs/SLAs) to drive measurable improvements in reliability.
* Build observability solutions to monitor AWS, Snowflake, and Databricks workloads.
* Collaborate with engineering teams to embed reliability best practices throughout platform development.
* Analyse incidents and proactively address root causes to improve availability and performance.
* Provide operational support, drive incident resolution, and implement automated fixes for recurring issues.
Requirements:
* Strong knowledge of SRE principles and practical experience defining SLAs, SLOs, and error budgets.
* Demonstrated AWS expertise (e.g., EC2, S3, IAM, VPC, CloudWatch) in production environments.
* Experience with observability tools, monitoring, and alerting practices.
* Proficient in automation, Infrastructure as Code (Terraform, CloudFormation, or CDK), and scripting (Python/Bash).
* Exposure to Snowflake and/or Databricks data platforms.
* Background in DR/chaos engineering, CI/CD pipelines, GitOps, or supporting large-scale data environments.
#J-18808-Ljbffr