DevOps Engineer – Observability & Cloud Platform
Cloud, Automation & Observability
Role Overview: Responsible for designing, automating and operating secure multi-cloud infrastructure, CI/CD pipelines and observability platforms across Azure and AWS. The role works closely with the development team, security and operations to improve delivery speed, platform reliability, cost efficiency and production visibility, with a strong focus on the Grafana observability stack.
KEY RESPONSIBILITIES
Cloud & Platform Engineering
•Design, build and maintain scalable, secure infrastructure across Azure and AWS, including networking, storage, identity, monitoring and managed platform services.
•Operate Docker and Kubernetes platforms, including AKS/EKS, cluster configuration, ingress, secrets/configuration, upgrades, deployment patterns and operational hardening.
•Implement cloud identity and access controls using Entra ID, AWS IAM, RBAC, least-privilege access and policy enforcement.
CI/CD, GitOps & Infrastructure as Code
•Design and maintain automated CI/CD pipelines using GitLab CI, Azure DevOps, Jenkins or GitHub Actions for build, test, approval and deployment across DEV/QA/PROD.
•Implement Infrastructure as Code with Terraform, including reusable modules, state management, environment promotion, peer review and auditable change control.
•Apply GitOps deployment patterns using Argo CD where appropriate, standardising promotion, drift detection, rollback and deployment visibility.
Observability & Grafana Stack
•Design and operate monitoring, logging, tracing, alerting and service-health reporting across infrastructure, Kubernetes, applications, APIs, queues and business-critical workflows.
•Build actionable Grafana dashboards and telemetry standards using Grafana, Prometheus, Loki, Tempo, Mimir, Alloy, OpenTelemetry, Azure Monitor, Log Analytics, Application Insights and Datadog where applicable.
•Optimise PromQL, LogQL and TraceQL queries, improve alert quality, reduce noise and ensure alerts are based on real service impact.
Reliability, Security & Cost Management
•Participate in incident response, root-cause analysis and post-incident reviews using logs, metrics, traces, deployment history and system behaviour.
•Create runbooks, automated checks, remediation scripts and preventative controls; support SLIs, SLOs, availability, latency and saturation indicators.
•Embed security into delivery through Key Vault/secrets management, TLS/certificates, network security, vulnerability remediation, secure SDLC practices, compliance evidence and cloud cost visibility.
REQUIRED SKILLS & EXPERIENCE
•Multi-cloud experience across Azure and AWS, including compute, networking, storage, identity, monitoring, security and managed services.
•CI/CD delivery using GitLab CI, Azure DevOps, Jenkins, GitHub Actions or similar, with Git-based workflows, release strategy and environment promotion.
•Terraform experience covering reusable modules, remote state, review/approval workflows, environment promotion and version-controlled infrastructure.
•Docker and Kubernetes experience, preferably AKS/EKS, including deployments, troubleshooting, ingress, secrets, upgrades and operational support.
•Experience with event management and incident management processes, including event correlation, alert triage, escalation workflows, incident logging, root-cause analysis, post-incident reviews and continuous improvement of operational response processes.
•Strong observability experience with several of: Grafana, Prometheus, Loki, Tempo, Mimir, Alloy, OpenTelemetry, Azure Monitor and Application Insights.
•Working knowledge of PromQL, LogQL, dashboard design, alert rules, telemetry pipelines, distributed tracing, structured logging and incident diagnostics.
•Scripting and automation using PowerShell, Bash, Python or similar.
•Security fundamentals including RBAC/IAM, Key Vault, certificates, network security, secrets hygiene, vulnerability remediation and least-privilege access.
DESIRABLE EXPERIENCE
Understanding of distributed systems and microservice architectures, including service-to-service communication, APIs, asynchronous messaging, event-driven patterns, scalability and operational troubleshooting. Programming capability in at least one modern language, such as C#, with the ability to read, understand, debug and contribute to application code where required. Familiarity with software deployment strategies would be advantageous.
WHAT SUCCESS LOOKS LIKE
•Deployment processes are repeatable, automated, low-risk and supported by approvals, rollback strategies and consistent environments.
•Infrastructure changes are version-controlled, auditable and managed through Terraform with minimal manual drift.
•Reliability improves through better monitoring, stronger alert quality, effective runbooks and shorter MTTR.
•Engineering teams have clear dashboards and telemetry to understand service health, investigate issues and make informed decisions.
•Security posture improves through least-privilege access, secrets hygiene, policy compliance, certificate management and vulnerability remediation.