Principal Engineer - Reliability Engineering
Join to apply for the Principal Engineer - Reliability Engineering role at Just Eat Takeaway.com
About the Role
We are seeking a seasoned Principal Engineer to lead the design, development, and evolution of our Observability Platform, ensuring it supports our rapidly scaling systems and engineering teams. The role involves leveraging Machine Learning (ML) and Artificial Intelligence (AI) to deliver advanced insights that proactively improve system health and reduce Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR). The ideal candidate will have deep expertise in observability, monitoring, distributed systems, and strategic platform development.
Key Responsibilities
1. Platform Leadership: Architect, design, and implement a scalable Observability Platform supporting metrics, logs, traces, and events; integrate ML/AI solutions for anomaly detection and predictive insights; develop platform capabilities to ensure system reliability and performance; establish standards and best practices.
2. Strategic Initiatives: Collaborate with engineering teams to define observability strategies; identify and integrate the latest observability technologies including AI-based analytics; promote a platform-first approach; implement real-time AI/ML-powered insights to enhance detection and resolution times.
3. Operational Excellence: Ensure high availability, performance, and security of the platform; optimize data collection, processing, and storage; define SLAs, SLOs, SLIs; leverage AI/ML models to improve MTTD and MTTR through predictive analysis and automated responses.
4. Mentorship and Collaboration: Act as a technical leader and mentor within engineering teams; work with stakeholders across SRE, infrastructure, and application teams; advocate for observability as a key operational enabler.
Qualifications
* Proven experience in building and scaling cloud-native observability platforms.
* Deep understanding of observability pillars (metrics, logs, traces) and tools such as Prometheus, Grafana, OpenTelemetry, Jaeger, Kibana, Elastic Stack.
* Hands-on experience integrating ML/AI models for insights, anomaly detection, and predictive analysis.
* Strong expertise in designing scalable distributed systems for high-throughput data processing.
* Proficiency in programming languages such as Go, Python, Java, and Infrastructure-as-Code tools like Terraform or Pulumi.
* Experience with cloud platforms (AWS, GCP, Azure) and managing observability costs.
* Leadership skills in guiding cross-functional engineering teams.
Preferred Qualifications
* Experience applying AI/ML for proactive alerting and predictive scaling.
* Knowledge of service mesh technologies (e.g., Istio, Linkerd).
* Contributions to open-source observability or ML/AI projects.
* Proficiency with containerization (Docker, Kubernetes).
* Understanding of statistical analysis and data mining techniques.
Additional Information
Join a dynamic, innovative environment at Just Eat Takeaway.com, a leading global online food delivery platform committed to diversity, inclusion, and employee growth. Explore more about our culture and opportunities on our career site.
J-18808-Ljbffr