Role Title: Platform / SRE Engineer
Location: Sheffield (3 days a week onsite is mandatory)
Duration: 30/11/2026
Rate: 525p/d via Umbrella
Role Description:
Own deployment, observability, reliability, cost control, and production operations for the AI helpdesk platform.
Key responsibilities
* Build and manage CI/CD pipelines, infrastructure, and runtime environments for AI services.
* Deploy and operate model-serving, orchestration, and application workloads.
* Implement monitoring, tracing, alerting, logging, and operational dashboards.
* Manage scaling, release processes, rollback mechanisms, and production support.
* Optimize inference cost, latency, uptime, and system reliability.
* Create runbooks, incident response processes, and operational standards.
Required skills
* Strong experience in DevOps, SRE.
* Experience with Docker, Kubernetes, cloud platforms, and infrastructure as code.
* Experience with monitoring and observability tools.
* Familiarity with CI/CD, release automation, secrets management, and production support.
* Understanding of LLM deployment patterns and API-based model integration.
* Experience with cloud, particularly AWS.
* Jira, Confluence, ServiceNow experience
Preferred
* Experience supporting AI/ML workloads in production.
* Experience with GPU workloads, autoscaling, and cost optimization.
#J-18808-Ljbffr