We're looking for a Splunk & OpenShift Observability Engineer to design, deploy, and optimise enterprise-grade monitoring across hybrid Kubernetes and OpenShift environments.
This is a high-impact role where you'll shape observability strategy, enhance service intelligence, and ensure platform reliability at scale - balancing performance, cost efficiency, and security governance.
You'll work at the intersection of platform engineering, observability, and service intelligence, helping to transform raw telemetry into actionable insight. This is an opportunity to influence reliability strategy, improve operational maturity, and deliver measurable value across a modern cloud-native estate.
What You'll Be Doing
Design, deploy, and operate Splunk Enterprise and ITSI across hybrid Kubernetes/OpenShift platforms
Onboard and normalise data at scale (HEC, Universal Forwarder, Deployment Server), aligning to CIM standards
Build and optimise ITSI service models: service trees, KPIs, adaptive thresholds, NEAP policies, glass tables, deep dives, and health scoring
Deliver OpenShift-focused executive and operational dashboards, including:
Cluster/API/etcd health
Node readiness and resource pressure
Pod restart trends and noisy-neighbour detection
Network and storage error visibility
Capacity, quota, and burst analysis
Optimise search and platform performance (workload rules, DMA, summary indexing, scheduling hygiene, concurrency tuning)
Implement intelligent alerting and automated routing into ITSM and ChatOps platforms, including enrichment, suppression windows, and maintenance scheduling
Govern data ingestion and security controls (RBAC, retention, PII handling, TLS, token governance, index and role mapping)
Integrate telemetry pipelines including OpenTelemetry, Prometheus, Fluentd/Fluent Bit/Vector, Kafka, CMDB and AIOps/ML solutions
Drive SLO/KPI alignment, golden signal monitoring, rollout/rollback health validation, and executive reporting
What You'll Bring
Deep expertise in Splunk Enterprise (SPL mastery, CIM alignment, saved searches, macros, KV stores, index/retention/RBAC design, performance tuning)
Strong experience with Splunk ITSI (service trees, KPIs, adaptive/time-based thresholds, NEAP tuning, Service Analyzer configuration)
Proven OpenShift/Kubernetes observability experience across control-plane metrics, events, logs, workload correlation, and capacity management
Hands-on experience with telemetry pipelines (OpenTelemetry/OTLP, Prometheus exporters, Fluentd/Fluent Bit/Vector, Kafka with TLS, HEC/UF/DS onboarding)
Strong understanding of reliability engineering principles (golden signals, SLO design, namespace/application KPI mapping)
Experience optimising performance and licensing costs using workload rules, DMA, and summary indexing
Solid security and compliance knowledge (TLS/mTLS, certificate/token hygiene, PII controls, auditability, role/index mapping)
Automation and integration expertise across ITSM, ChatOps, webhooks, CMDB enrichment, and AIOps tooling