We are seeking a highly experienced Kubernetes DevOps Engineer to design, build, scale, and operate our IKP infrastructure platform. This role is responsible for ensuring system reliability, scalability, availability, and performance across production environments.
The ideal candidate will bring 7–10 years of experience in cloud-native infrastructure, deep Kubernetes expertise, and hands-on experience operating workloads in Google Cloud Platform (GCP) environments. You will play a key role in automation, incident management, reliability engineering, and continuous platform improvement.
Key Responsibilities
* Own and ensure the reliability, availability, scalability, and performance of the Kubernetes-based infrastructure platform.
* Design, implement, and maintain production-grade Kubernetes clusters (preferably in GCP).
* Collaborate with engineering teams to diagnose, troubleshoot, and resolve infrastructure and application issues.
* Lead incident response, perform root cause analysis (RCA), and implement preventive reliability improvements.
* Develop and maintain infrastructure as code (IaC) solutions to support scalable and repeatable deployments.
* Automate operational processes to reduce manual intervention and improve system resilience.
* Implement and manage observability frameworks, including monitoring, logging, and alerting solutions.
* Support CI/CD integrations and platform enhancements.
* Participate in on-call rotations, including weekend support as required.
Required Experience & Qualifications
* 7–10 years of experience in infrastructure engineering, DevOps, or Site Reliability Engineering roles.
* Strong hands-on experience with Kubernetes architecture, administration, and operations in production environments.
* Proven experience working in Google Cloud Platform (GCP), including services such as:
* Compute Engine
* IAM
* Cloud Monitoring and Logging
* Strong understanding of containerization technologies (Docker) and orchestration principles.
* Experience implementing and managing Infrastructure as Code (Terraform, Deployment Manager, or similar tools).
* Proficiency in observability tools such as Prometheus, Grafana, Stackdriver, or equivalent.
* Experience with CI/CD pipelines and automation frameworks.
* Solid troubleshooting and performance tuning expertise in distributed systems.
* Strong analytical mindset with a proactive approach to reliability and risk mitigation.
* Ability to work collaboratively in cross-functional teams.
Preferred Qualifications
* Experience with service meshes (Istio, Linkerd, or similar).
* Experience implementing SLOs, SLIs, and error budgets.
* Knowledge of security best practices in cloud-native environments.
* Relevant certifications such as:
* Google Professional Cloud DevOps Engineer
#J-18808-Ljbffr