Site reliability engineering is a critical component of ensuring the scalability and automation of cloud-based products and services. In this role, you will be responsible for embedding SRE best practices, improving platform resilience, and troubleshooting service issues using an engineering-first approach.
This position requires hands-on expertise with Google Cloud Platform (GCP), including security and networking, as well as strong experience managing Kubernetes clusters in a production environment.
You should have proficiency in Python, Java, Golang, BASH, or PowerShell for scripting and automation, as well as understanding of SLOs, SLIs, and SLAs, with experience in monitoring, alerting, and logging.
The ideal candidate will also have familiarity with Infrastructure as Code (Terraform) and CI/CD pipelines (Jenkins, Azure DevOps, etc.), as well as experience with observability tools like Dynatrace, Stackdriver, Cloud Operations Suite, Cloud Monitoring, and Cloud Logging.