What you’ll be doing
1. Lead a team of technical engineers to manage the full AI/ML application lifecycle across test/preprod/prod environments, ensuring repeatable, reliable releases.
2. Implement and mature an MLOps framework covering code/data/model versioning, automated testing, release governance, rollback strategies and environment promotion controls.
3. Own production readiness for AI/ML workloads: SLOs, runbooks, operational dashboards, support processes, incident response and post‑incident RCA improvements.
4. Design and operate CI/CD for ML solutions using patterns such as SageMaker model registry, controlled approvals and secure promotion of model artefacts through environments.
5. Get deep understanding of the underneath use case and the data which is being used to develop and train the models.
6. Implement model monitoring (e.g. data quality, model quality, bias drift, feature attribution drift) and alerting driving automated responses such as retraining triggers and controlled redeployments.
7. Put in place drift detection, evaluation routines, and model performance reporting; partner with data science to define thresholds, baselines and acceptance criteria.
8. Establish operational controls for agentic systems like policy boundaries, auditing of tool usage, quality evaluation and performance monitoring, aligned to enterprise requirements.
9. Support production operations of generative AI applications using Amazon Bedrock and Amazon Bedrock AgentCore capabilities to deploy and operate agents securely at scale, with strong governance.
10. Design and implement end‑to‑end observability for serverless services (e.g., Lambda, Step Functions, EventBridge, APIs), including structured logs, metrics, distributed traces, dashboards, alerting and correlation across workflows.
11. Monitor agent behaviour, token usage/cost trends, latency, workflow health and security access patterns; drive continuous improvement and cost optimisation with FinOps-aligned reporting.
12. Define standards for documentation, change management and quality gates that reduce MTTR and improve platform reliability.
The skills you’ll need
Cloud DeploymentCloud StrategyIT Service DeliveryCloud SecurityCloud Architecture/DesignCloud MigrationVirtualisationAgile MethodologiesCloud OperationsContinuous Integration/Continuous Deployment Automation & OrchestrationCloud StorageSoftware Development LifecycleProject/Programme ManagementTalent ManagementDecision MakingGrowth MindsetPerformance ManagementInclusive Leadership
Our leadership standards
Looking in:
Leading inclusively and Safely
I inspire and build trust through self-awareness, honesty and integrity.
Owning outcomes
I take the right decisions that benefit the broader organisation.
Looking out:
Delivering for the customer
I execute brilliantly on clear priorities that add value to our customers and the wider business.
Commercially savvy
I demonstrate strong commercial focus, bringing an external perspective to decision-making.
Looking to the future:
Growth mindset
I experiment and identify opportunities for growth for both myself and the organisation.
Building for the future
I build diverse future-ready teams where all individuals can be at their best.