Job Description
We are looking for a highly skilled Site Reliability Engineer (SRE) to own and evolve our enterprise observability and reliability platforms.
This role is responsible for ensuring availability, performance, scalability, and reliability of large-scale, cloud-native applications running on Kubernetes and OpenShift.
The SRE will partner closely with application and platform teams to embed reliability engineering, SLO-driven operations, and automation-first practices.
Key Responsibilities
* Reliability Engineering & SRE Practices: Define, implement, and continuously improve SLIs, SLOs, and error budgets for enterprise applications.
* Drive reliability-focused decision making using error budgets, MTTD, MTTR, and service health metrics.
* Proactively identify reliability risks and performance bottlenecks and drive remediation.
* Lead incident response, post-incident reviews (blameless postmortems), and reliability improvements.
* Observability Platform Ownership: Own and operate open-source–based observability platforms covering metrics, logging, and distributed tracing.
* Enhance, optimize, and migrate observability solutions to improve scalability, resilience, and cost efficiency.
* Maintain and tune Prometheus and other TSDBs, including cardinality management and resource optimization.
* Operate distributed tracing platforms such as OpenTelemetry, Jaeger, and Zipkin, including tuning sampling strategies and troubleshooting microservices traces.
* Kubernetes & OpenShift Reliability: Support and enable application teams to migrate workloads to newer OpenShift/Kubernetes versions.
* Deploy, manage, and troubleshoot stateful and stateless workloads on Kubernetes platforms.
* Improve platform reliability through automation, self-healing, and standardized deployment patterns.
* Partner with developers to implement application instrumentation and reliability best practices.
* Logging, Alerting & Incident Response: Operate enterprise logging platforms such as ELK Stack and Grafana Loki, including Elasticsearch cluster management and index lifecycle management.
* Design and maintain actionable alerting aligned to SLOs and business impact.
* Integrate alerting platforms with PagerDuty, Microsoft Teams, and other incident management tools.
* Reduce alert fatigue by implementing alert hygiene and signal-to-noise optimization.
* Dashboards & Service Visibility: Deploy and administer visualization tools such as Grafana and Kibana.
* Create standardized, reusable dashboards for service health, reliability, and capacity planning.
* Implement and manage RBAC across observability platforms.
* Infrastructure, Security & Automation: Troubleshoot observability infrastructure issues across Linux VMs and Kubernetes pods.
* Secure observability and platform endpoints using TLS, reverse proxies, and authentication mechanisms (MFA, LDAPS, OAuth).
* Build and maintain CI/CD pipelines for observability and reliability tooling.
* Extend pipelines to support multiple environments and regions with consistency and repeatability.
* Reliability Culture & Enablement: Champion an SRE and observability-first culture across engineering teams.
* Coach teams on golden signals, service health modeling, and reliability trade-offs.
* Enable teams to move from reactive monitoring to proactive reliability engineering.
Required Skills & Experience
* Core Technical Skills Strong hands-on experience with: Prometheus, Grafana; Elasticsearch, Kibana (cluster operations, ILM, tuning); OpenTelemetry, Jaeger, Zipkin; Kubernetes & OpenShift; Linux OS troubleshooting; CI/CD pipelines and automation
* Solid understanding of SRE principles, including SLIs, SLOs, error budgets, and incident management.
* Experience supporting production, highly available, distributed systems.
* Working Hours: Monday to Friday, 9:00 AM – 6:00 PM. Occasional weekend support may be required for critical deployments or incidents; compensatory off will be provided.
#J-18808-Ljbffr