Role Summary
As a Site Reliability Engineer (SRE) for our Data Platform, you will be the guardian of our mission-critical data infrastructure. You will bridge the gap between software engineering and systems operations to ensure our cloud-native environment-built on AWS, Snowflake, and Databricks-is scalable, resilient, and highly available. Your mission is to treat operations as an engineering problem, using automation to eliminate toil and driving a 'reliability-first' culture across our data ecosystem.
Key Responsibilities
1.
Infrastructure as Code (IaC): Design and maintain automated provisioning and configuration management for AWS and data platform components using Terraform or CDK.
2.
Resiliency & Disaster Recovery: Lead the strategy for high availability. You will design and execute DR drills, failure-mode testing, and recovery validation to ensure data integrity during outages.
3.
Reliability Engineering: Define and monitor SLIs, SLOs, and SLAs. You will manage error budgets to balance the velocity of data engineering with the stability of the platform.
4.
Observability: Implement comprehensive monitoring, logging, and tracing (using tools like CloudWatch, Datadog, or Grafana) to provide deep visibility...