Jobs
My ads
My job alerts
Sign in
Find a job Career Tips Companies
Find

Graduate engineer: ai tooling and site reliability

Cardiff
Critical Cloud Limited
Engineer
€37,500 a year
Posted: 17h ago
The role

We're building an internal AI platform from scratch, the tooling that will define how Critical Cloud operates as we scale across Europe. This isn't a rotation or a shadow programme. From week one you'll be shipping real tooling and operating real production environments for real customers. The two tracks exist because they make each other better. That's the design.

About the Role

This isn't a rotation programme. From week one, you'll contribute to both tracks: shipping AI tooling that helps us run cloud operations better, and operating real production infrastructure for real customers. Two disciplines, one engineer, no siloes.

Critical Cloud is the world's first "Powered by Datadog" accredited MSP, a Datadog-native cloud MSP built for European tech‑led SMBs. We're building an internal AI platform (the Critical Cloud Platform) to automate and augment how we operate customer environments. This role sits at the centre of that programme.

Half your time will be engineering AI‑assisted tooling: LLM integrations, agents, and automation workflows that reduce toil and improve our operational quality. The other half will be hands‑on SRE work: monitoring, incident support, infrastructure‑as‑code, and customer‑facing operations. Each half makes you better at the other.

What You’ll Do

AI Tooling Track

  • Build and iterate on AI‑assisted automation workflows using LLM APIs (Claude, OpenAI) integrated with cloud and observability tooling
  • Develop tooling for automated infrastructure discovery, customer onboarding, and operational runbook generation
  • Contribute to the Critical Cloud Platform: our internal AI governance framework and agent operating model
  • Design and implement MCP (Model Context Protocol) integrations connecting AI agents to Datadog, AWS, and Azure APIs
  • Write evaluation harnesses and regression tests to keep AI tool output reliable and auditable
  • Document AI system behaviour against our constitutional operating framework and ISO 27001 controls

Site Reliability Track

  • Monitor and triage alerts across customer AWS and Azure environments using Datadog as the primary observability platform
  • Support incident response workflows and contribute to postmortem documentation alongside the SRE team
  • Support Datadog onboarding for new customers: instrumentation, dashboards, monitors, and SLO configuration
  • Write and maintain Terraform modules for infrastructure provisioning and change management
  • Produce and maintain operational runbooks, escalation guides, and change records to ISO 27001 standards
  • Contribute SRE context back into AI tooling: you'll know what's worth automating because you've done it manually

Requirements

  • A degree in Computer Science, Software Engineering, or a related technical field (2:1 or above)
  • Solid Python: comfortable writing scripts, working with APIs, and handling structured data
  • Familiarity with cloud fundamentals (AWS or Azure), ideally through coursework, personal projects, or placement
  • Experience consuming REST APIs or LLM APIs, whether through a project, dissertation, or side work
  • Clear written communication: you'll be writing docs and talking to customers

Nice to Have

  • Hands‑on LLM work: prompt engineering, tool use, agent frameworks, or evaluation pipelines
  • Terraform or any IaC tooling (even tutorials count)
  • Datadog experience, even a free tier account you've played with
  • Kubernetes or containerised workload exposure
  • Any cloud or AI certification (AWS, Azure, Google, or Datadog)
  • A GitHub profile with something worth showing us

AI & Automation

Claude / Anthropic API – Primary LLM platform

Datadog – Core observability platform

AWS – Primary cloud, multi‑account

Azure – Secondary cloud workloads

Terraform – Infrastructure as code

GitHub Actions – CI/CD pipelines

Start

Year 1–2

Year 2–3

Engineer II – Specialise or Broaden

Year 3+

Senior / Lead – Platform or SRE

Benefits

  • 25 days holiday + bank holidays plus a paid day off in your birthday month, taken in the month it falls
  • Holiday grows with tenure: +1 day per year after your second work anniversary, up to 28 days total
  • Enhanced maternity pay: 26 weeks at your full basic salary
  • Enhanced paternity pay: 2 weeks at your full basic salary
  • Datadog, AWS, Azure, and AI tooling certifications paid by the company, contractual obligation, not a discretionary budget
  • Flexible working requests from your first day of employment, statutory right, supported in full
  • Company‑provided laptop and peripherals, set up before you start

Who Thrives Here

The ideal candidate doesn't have to choose between writing code and running infrastructure. They're curious about both and understand that the two inform each other. You'll build AI tooling that automates real operational problems precisely because you've experienced those problems hands‑on in the SRE track.

We operate to ISO 27001. Everything we build, including AI systems, has to be explainable, auditable, and consistent with our governance framework. If you care about building AI tools that are reliable, not just impressive demos, you'll fit right in.

This is an early career role, but we don't run it like one. You'll have genuine ownership, direct access to founders, and the chance to shape a platform that will define how Critical Cloud operates at scale.

How We Work

Own the Problem

When something breaks in a customer environment, you take it through to resolution and document it properly. Not "I raised a ticket." Not "I told the senior." You own it.

Stay Curious

The AI tooling track exists because engineers asked "what if we automated that?" This role rewards people who look at repetitive manual work and immediately start thinking about whether they could build their way out of it. The worst automation is the one nobody trusts because it's too complicated. Build for the on‑call engineer picking it up at 3am without context. A runbook anyone can follow is worth more than one only you understand.

Be Resourceful

You’ll hit problems on both tracks where the answer isn't in a tutorial. The engineers who thrive here figure things out, with what they have, in the time they have, to the standard required.

#J-18808-Ljbffr
Apply
Create E-mail Alert
Job alert activated
Saved
Save
Similar job
Cxp staff engineer (aem), fully remote
Cardiff
Sanderson
Engineer
£80,000 a year
Similar job
Engineer - refurb
Newport (Newport)
Speedy Hire
Engineer
Similar job
Engineer - electrical
Newport (Newport)
Speedy Hire
Engineer
See more jobs
Similar jobs
Engineering jobs in Cardiff
jobs Cardiff
jobs Cardiff
jobs Wales
Home > Jobs > Engineering jobs > Engineer jobs > Engineer jobs in Cardiff > Graduate Engineer: AI Tooling and Site Reliability

About Jobijoba

  • Career Advice
  • Company Reviews

Search for jobs

  • Jobs by Job Title
  • Jobs by Industry
  • Jobs by Company
  • Jobs by Location
  • Jobs by Keywords

Contact / Partnership

  • Contact
  • Publish your job offers on Jobijoba

Legal notice - Terms of Service - Privacy Policy - Manage my cookies - Accessibility: Not compliant

© 2026 Jobijoba - All Rights Reserved

Apply
Create E-mail Alert
Job alert activated
Saved
Save