Responsibilities
* Use Terraform/CloudFormation to deploy reusable modules for all AWS components (MSK, Flink, EMR, S3, DynamoDB, Aurora, EC2, Neptune, SageMaker, networking, IAM).
* Implement automated CI/CD pipelines using GitLab/Jenkins for IaC, services, data pipelines, ML models, and inference deployments.
* Enforce deployment safety with safeguards: approvals, unit/integration tests, automated rollbacks, compliance checks.
* Implement multi-AZ VPC architectures with private subnets, NAT, routing, endpoint policies, and secure on-prem connectivity (Direct Connect/VPN).
* Engineer governed multi-layer data storage using S3 (raw/curated), DynamoDB (metadata/state), Aurora/RDS (relational stores), Redshift/Iceberg (analytical stores).
* Apply advanced AWS security practices including IAM least-privilege, KMS encryption, secrets management, CloudTrail governance, Config rules, GuardDuty/Security Hub integration.
* Build end-to-end observability across pipelines using CloudWatch metrics/logs/alarms, OpenTelemetry for distributed tracing (where applicable), and data quality and pipeline SLIs/SLOs.
* Develop runbooks, playbooks, synthetic tests, chaos testing, failover simulations, and incident triage workflows.
* Operate AWS Neptune graph clusters for network topology representation.
* Promote software releases to pre-production and production environments and ensure a smooth transition to ASG.
* Define standards for documentation, change management and quality gates that reduce MTTR and improve platform reliability.
Qualifications
* Full understanding of AWS resilience patterns, AWS backup, and incident response with break-glass approaches.
* Full detailed knowledge of Lambda (runtimes, packaging, storage, concurrency, retries and dead-letter queues).
* Extensive hands-on experience with AWS CLI and working knowledge of Python.
* Strong AWS experience, particularly with Compute, Databases and Streaming services like MSK, Flink, S3, DynamoDB, Aurora/RDS, Redshift, Lambda etc.
* Sound understanding of VPCs, Route tables, IGW, NAT, ACLs, Endpoints, Transit Gateway, DNS, VPC endpoints.
* Experience designing observability for serverless systems (logs/metrics/traces) and implementing distributed tracing and dashboards using open standards and AWS tooling like CloudWatch.
* Understanding of how IAM roles and policies work.
* Ability to perform deep-dive using CloudWatch & X-Ray for troubleshooting (covering logs, metrics, alerts, traces, filters, agents, etc.).
Desirable Skills
* Familiarity with event-driven architectures (Step Functions, EventBridge, queues), and shift-left quality practices (automated testing, policy-as-code, guardrails).
* Comfortable leading incident deep-dives and contributing to permanent fixes.
* Relevant AWS certifications.
* Pragmatic engineer who automates wherever possible to reduce toil.
* Strong analytical thinking and structured problem-solving skills.
Benefits
* 10% on target annual bonus.
* Access to an online private GP 24/7 for you and your immediate family.
* Market-leading paid carers leave with up to 2 weeks off.
* Equalised maternity, paternity, and adoption leave - 18 weeks full pay and 8 weeks half pay.
* Discounted EE and BT products, including mobile and broadband.
* Market leading Pension scheme - 5% from you and 10% from us.
* Holiday purchase scheme.
#J-18808-Ljbffr