OpenShift Platform Lead
Job Information
Job Title: OpenShift Platform Lead - Virtualization Services
Job Summary
We are seeking an experienced OpenShift Platform Lead to own and manage our OpenShift-based virtualization platform that delivers enterprise VM hosting services. This role is responsible for the complete lifecycle management of the platform, including design, architecture, BAU operations, patching, upgrades, incident response, and driving platform stability.
You will lead the implementation, work closely with SRE and operations teams, and enable seamless VM migration from legacy infrastructure. This is a hands-on technical leadership role requiring deep OpenShift expertise and the ability to balance operational excellence with strategic platform evolution.
Key Responsibilities
Platform Leadership & Strategy
* Own the technical strategy and roadmap for the OpenShift Virtualization platform
* Define platform architecture, design patterns, and technical standards
* Lead platform lifecycle management including major/minor upgrades and Red Hat CoreOS updates
* Drive platform stability improvements and performance optimization initiatives
* Establish platform governance, compliance, and security policies
* Build relationships with Red Hat support and leverage Technical Account Management (TAM)
Lifecycle & Operations Management
* Manage complete platform lifecycle from installation through upgrades to decommissioning
* Plan and execute OpenShift platform upgrades (4.x releases) with zero/minimal downtime
* Coordinate quarterly/monthly Red Hat CoreOS (RHCOS) patching cycles
* Oversee OpenShift Virtualization operator upgrades and feature enablement
* Maintain platform health through proactive monitoring and capacity planning
* Ensure platform meets defined SLAs and availability targets (99.9%+)
Incident & Event Management
* Lead Major Incident response for platform-level issues (Sev 1/2)
* Perform root cause analysis (RCA) and implement preventive measures
* Collaborate with SRE team on incident postmortems and improvement plans
* Manage platform-related events including maintenance windows
* Coordinate emergency changes and rollback procedures
* Participate in on-call rotation for critical platform escalations
Change Implementation & Release Management
* Review and approve platform changes through Change Advisory Board (CAB)
* Plan and execute complex platform changes with risk assessment
* Implement infrastructure-as-code (IaC) practices using Ansible and Terraform
* Drive GitOps adoption for platform configuration management
* Coordinate release windows for platform updates with business stakeholders
* Ensure change documentation and runbook accuracy
VM Migration & Workload Onboarding
* Lead VM migration strategy from VMware/legacy platforms to OpenShift Virtualization
* Design VM migration runbooks and automation workflows
* Create and maintain VM templates, golden images, and standardized configurations
* Enable application teams for self-service VM provisioning
* Troubleshoot VM performance, networking, and storage issues
* Optimize VM placement, resource allocation, and cluster balancing
Platform Stability & Performance
* Define and monitor key performance indicators (KPIs) for platform health
* Implement chaos engineering practices to validate platform resilience
* Tune OpenShift control plane and worker node performance
* Optimize storage performance (ODF/Ceph) for VM workloads
* Configure network policies and OVN-Kubernetes for optimal VM networking
* Drive continuous improvement initiatives based on operational metrics
Required Qualifications
Must-Have Skills & Experience
Experience Requirements:
* 8-12 years of overall IT infrastructure experience
* 5+ years of hands-on experience with Red Hat OpenShift Container Platform (4.x)
* 3+ years of experience with OpenShift Virtualization (KubeVirt) or similar VM hosting platforms
* 3+ years of experience in platform/infrastructure leadership roles
* 2+ years of experience with Red Hat Enterprise Linux (RHEL 7/8/9) and Red Hat CoreOS (RHCOS)
Technical Skills:
* Expert-level OpenShift administration (oc CLI, Web Console, API)
* Advanced OpenShift Virtualization knowledge (VMs, DataVolumes, CDI, live migration)
* Advanced Red Hat CoreOS and Machine Config Operator (MCO) experience
* Advanced Linux administration and troubleshooting (RHEL-based)
* Advanced storage management (ODF/Ceph, Storage Classes, PV/PVC, CSI drivers)
* Advanced networking (OVN-Kubernetes, Multus, Network Policies, SDN concepts)
* Advanced automation skills (Ansible, Bash scripting, Python)
* Intermediate Kubernetes concepts (Operators, Custom Resources, Pod lifecycle)
* Intermediate Infrastructure-as-Code (Terraform, GitOps tools like ArgoCD/Flux)
* Intermediate observability platforms (Prometheus, Grafana, AlertManager)
Platform Operations:
* Proven experience managing platform lifecycle (installation, upgrades, patching)
* Strong incident management and major incident response experience
* Experience with change management processes and release coordination
* Demonstrated ability to perform root cause analysis and implement preventive measures
* Experience with capacity planning and performance tuning
* Track record of driving platform stability improvements
Certifications Required (one or more):
* Red Hat Certified Engineer (RHCE)
* Red Hat Certified Specialist in OpenShift Administration
* OR equivalent demonstrable experience
Desirable Skills & Experience
Highly Desirable:
* Red Hat Certified Architect (RHCA) certification
* Red Hat Certified Specialist in OpenShift Virtualization
* Experience with Red Hat Advanced Cluster Management (RHACM)
* Experience with Red Hat Advanced Cluster Security (RHACS/Stackrox)
* GitOps expertise (ArgoCD, Flux, Tekton)
* Chaos engineering experience (Litmus, Chaos Mesh)
* Experience with OpenShift on multiple infrastructures (bare metal, VMware, AWS, Azure)
Nice to Have:
* Certified Kubernetes Administrator (CKA) or CKS
* Experience with multi-tenancy and namespace isolation strategies
* Knowledge of compliance frameworks (PCI-DSS, HIPAA, SOC2, ISO 27001)
* Experience with backup solutions (Kasten K10, Veeam, Commvault)
* Programming skills in Go, Python, or Java
* Experience with hybrid/multi-cloud architectures
* ITIL v4 Foundation certification
Key Success Metrics
* Platform availability: 99.9%+ uptime
* Successful upgrade completion rate: 100% with zero unplanned rollbacks
* Incident MTTR: 98% first-time success
Work Environment
* Some evening/weekend work required for maintenance windows
* Available 24 x7 during major issues