Role Summary
As a Site Reliability Engineer (SRE) for our Data Platform, you will be the guardian of our mission-critical data infrastructure. You will bridge the gap between software engineering and systems operations to ensure our cloud-native environment-built on AWS, Snowflake, and Databricks -is scalable, resilient, and highly available. Your mission is to treat operations as an engineering problem, using automation to eliminate toil and driving a 'reliability-first' culture across our data ecosystem.
Key Responsibilities
Infrastructure as Code (IaC): Design and maintain automated provisioning and configuration management for AWS and data platform components using Terraform or CDK .
Resiliency & Disaster Recovery: Lead the strategy for high availability. You will design and execute DR drills, failure-mode testing, and recovery validation to ensure data integrity during outages.
Reliability Engineering: Define and monitor SLIs, SLOs, and SLAs. You will manage error budgets to balance the velocity of data engineering with the stability of the platform.
Observability: Implement comprehensive monitoring, logging, and tracing (using tools like CloudWatch, Datadog, or Grafana) to provide deep visibility into Snowflake and Databricks workloads.
Incident Management & RCA: Lead the response to platform incidents. You won't just fix the problem; you will perform deep-dive Root Cause Analysis (RCA) to ensure the same issue never happens twice.
Toil Reduction: Identify manual operational tasks and automate them out of existence, improving the developer experience for our data scientists and analysts.
TPBN1_UKTJ