Job Description
Job Title: AWS Site Reliability Engineer (Data Platform)
Role Summary
We are looking for an AWS Site Reliability Engineer (SRE) to support and scale a cloud-native data platform built on AWS, Snowflake, and Databricks. The role focuses on driving reliability through automation, disaster recovery (DR) testing, resiliency engineering, observability, and proactive SLO/SLI/SLA management.
Key Responsibilities
1. Design, build, and maintain automation for infrastructure provisioning, platform operations, and incident response using IaC and CI/CD.
2. Lead resiliency and disaster recovery planning, including regular DR drills, failure testing, and recovery validation across AWS and data platform components.
3. Define, implement, and manage SLIs, SLOs, and SLAs for critical data pipelines and platform services; use error budgets to guide reliability improvements.
4. Build and operate robust observability solutions (metrics, logs, traces, alerts) for AWS services, Snowflake, and Databricks workloads.
5. Partner with data engineering and platform teams to embed reliability-by-design into architecture and delivery practices.
6. Perform root cause analysis (RCA) and drive continuous improvement