Overview
Platform Engineer — Cloud & Infrastructure Automation and Observation. We are seeking a Platform Engineer to join our technology team and play a central role in managing and automating our hybrid cloud and on-premises infrastructure. Working closely with the Technology Director, Development and IT & Systems teams, you will help drive automation, reliability and operational excellence across the full technology estate. Our infrastructure operates across a hybrid model spanning multiple cloud providers and on-premises environments, supporting a fast-growing, high-volume e-commerce operation. You will champion Infrastructure as Code, build robust CI/CD and deployment pipelines, establish comprehensive observability, and drive the cultural shift towards modern DevOps practices across the engineering organisation.
Responsibilities
* Infrastructure Automation & Management
o Infrastructure as Code: Define, provision and manage cloud and on-premises infrastructure using IaC tools (CloudFormation, Terraform, Ansible or similar), eliminating manual configuration and ensuring repeatable, version-controlled environments
o Hybrid Cloud Management: Manage and optimise infrastructure across multiple cloud providers and on-premises environments, ensuring consistent governance, security and cost efficiency across the entire estate
o On-Premises & Local Infrastructure: Work alongside the IT & Systems team to manage local server infrastructure including Windows Server environments (Domain Controllers, Hyper-V, application and file servers), Linux systems and network security appliances; use IaC tools such as Terraform and Packer to automate the provisioning of local virtual machines and container clusters, ensuring local environments match production standards
o Infrastructure Lifecycle Management: Oversee server maintenance, security patching, storage provisioning and networking equipment management across both cloud and local infrastructure, ensuring consistent standards regardless of where workloads run
* CI/CD, Deployments & Release Engineering
o CI/CD Pipeline Development: Design, build and maintain continuous integration and deployment pipelines using GitHub Actions, Cloud Build and related tooling, enabling rapid, reliable releases across all environments
o Controlled Rollouts & Deployment Strategies: Implement blue-green deployments, canary releases and rolling updates for application, database and infrastructure changes, minimising disruption and enabling safe rollback
o Database Deployments: Manage and automate database schema migrations and deployments, ensuring zero-downtime releases through controlled rollout strategies
o Runtime Mitigation: Utilise tooling to patch or isolate vulnerable containers in production without interrupting service, enabling rapid response to security findings
o Build Reliability: Monitor pipeline health and implement automated alerting for build failures, ensuring the team addresses delivery blockers immediately
* Observability, Monitoring & Alerting
o Full-Stack Observability: Architect and maintain a comprehensive observability strategy across all systems, consolidating and extending existing monitoring infrastructure (Zabbix, CloudWatch) with modern tooling such as Grafana, Loki, New Relic or Datadog to ensure proactive alerting and full visibility
o Automated Incident Management: Set up integrations between monitoring tools and Jira Service Management to automatically generate incident tickets when production systems fail or breach performance thresholds, and automate ticket triage, prioritisation and escalation
o Workflows: Pipeline & Build Alerting: Configure automation to raise Jira tasks or bugs when critical deployment pipelines fail, ensuring delivery blockers are tracked and resolved promptly
o Visibility & Reporting: Build dashboards and automated reporting for incident tracking, post-mortem outcomes and system health, providing transparency to engineering leadership
* Security & Vulnerability Management
o Cloud Security Posture: Maintain and enhance security tooling including GuardDuty, Security Hub, Macie and Inspector; manage secrets, IAM policies and network segmentation to ensure compliance with PCI-DSS and data protection requirements
o DevSecOps Integration: Integrate application security scanning tools such as Snyk into CI/CD pipelines, shifting security left and embedding vulnerability detection into the development workflow
* Reliability, Cost & Performance
o Reliability Engineering: Implement SLIs, SLOs and error budgets; design and conduct game days and disaster recovery exercises; lead incident response and blameless post-mortems to continuously improve system resilience
o Capacity Planning: Proactively manage capacity across non-autoscaling and autoscaling architectures, ensuring readiness for peak trading events (Black Friday, Cyber Monday, seasonal promotions) through load testing and performance benchmarking
o Cost Management: Monitor and optimise spend across cloud providers and local infrastructure, implementing tagging strategies, right-sizing recommendations and reserved/spot instance policies; work with IT & Systems to manage hardware lifecycle costs, storage provisioning and networking equipment budgets
* AI-Driven Operations & Proactive Optimisation
o AIOps & Intelligent Monitoring: Leverage AI-driven tools for anomaly detection, predictive alerting and proactive system optimisation, reducing mean time to detection and resolution
o AI-Enhanced CI/CD: Explore and implement AI-assisted pipeline optimisation, intelligent test selection and automated code quality analysis to accelerate delivery
o Resource & Cost Optimisation: Use AI-powered recommendations for infrastructure right-sizing, workload scheduling and cost forecasting across the hybrid estate
o Compliance & Data Hygiene: Apply AI tooling to automate compliance checks, configuration drift detection and data hygiene across environments
* Collaboration & Documentation
o Cross-Team Collaboration: Liaise closely with Development, IT & Systems and Data teams to ensure system uptime, support deployment workflows, unblock developer productivity and align infrastructure decisions with business objectives
o Documentation & Knowledge Sharing: Maintain comprehensive runbooks, architecture documentation and disaster recovery plans; champion DevOps best practices and mentor team members on infrastructure tooling and processes
Required Skills & Experience
* Core Technical Skills
o 5+ years' commercial experience in a DevOps, Site Reliability or Infrastructure Engineering role within a hybrid cloud and on-premises environment
o Extensive hands-on experience with AWS services including EC2, RDS, ECS, ElastiCache, S3, CloudFront, Route 53, Lambda, IAM, VPC networking and WAF
o Strong understanding of cloud billing models, cost allocation and optimisation strategies
o Proficiency with AWS CloudFormation for infrastructure provisioning; experience with Terraform or Pulumi is a plus
o Experience with container orchestration using ECS and/or Kubernetes
o Extensive experience with Docker, including containerisation, image management and multi-stage builds
o Experience with Cloudflare or similar CDN, edge security and DNS management services
o Proficiency in Linux administration (Ubuntu, CentOS/RHEL) with solid understanding of networking fundamentals: DNS, load balancing, VPNs, firewalls, subnets, NAT and VPC peering
* Languages & Scripting
o Python: Essential. Used extensively for writing custom security and automation tooling, interacting with cloud APIs (Boto3), log analysis and DevSecOps workflows
o Bash/Shell: Essential. Required for writing Docker container entrypoint scripts, automating Linux server tasks and managing CI/CD runner environments
* CI/CD & Deployment
o Experience building and maintaining CI/CD pipelines with GitHub Actions, Jenkins, GitLab CI or similar
o Proven experience implementing blue-green deployments, canary releases and controlled rollout strategies for application, database and infrastructure changes
o Version control best practices with Git, including branching strategies and code review workflows
* Security & Compliance
o Experience with cloud security tooling including GuardDuty, Security Hub, Macie and Inspector
o Familiarity with application security scanning tools such as Snyk or equivalent
o Working knowledge of IAM best practices, secrets management, network segmentation and encryption at rest and in transit
o Awareness of PCI-DSS requirements in an e-commerce context
* Observability & Reliability
o Proven experience implementing monitoring, logging and alerting using tools such as New Relic, Grafana, Loki, CloudWatch, Prometheus or Datadog
o Understanding of SRE principles: SLIs, SLOs, error budgets, incident management and blameless post-mortems
o Experience with log aggregation and analysis (ELK/OpenSearch, CloudWatch Logs)
o Experience integrating monitoring and CI/CD tooling with service management platforms (e.g. Jira Service Management) for automated ticket creation and incident workflows
* Desirable Skills
o Experience with system programming languages such as Golang
o Experience with Microsoft Azure, ideally including Virtual Machines, SQL Server and Active Directory
o Experience with Google Cloud Platform, ideally including Cloud Functions, Cloud Scheduler, Pub/Sub and Cloud Build
o Familiarity with Windows Server administration (Domain Controllers, Hyper-V, Group Policy); candidates without this experience should be comfortable learning it as part of the role
o Familiarity with PowerShell for Windows automation
o Experience with database administration across MySQL, MariaDB, PostgreSQL and SQL Server, including replication and failover strategies
o Exposure to Elasticsearch/OpenSearch cluster management
o Experience managing Redis/ElastiCache clusters for caching and session management
o Experience with AI-driven operations tooling (AIOps, ChatOps, intelligent monitoring)
o Experience with high-volume e-commerce environments and peak traffic management
o Mathematics or Computer Science degree (or equivalent experience)
#J-18808-Ljbffr