Ai agent reliability engineer - chaps

London

Craft Docs Limited, Inc.

Reliability engineer

Posted: 30 September

Offer description

About Craft & Chaps

At Craft, we rethink productivity from first principles. Our products disappear into the background so people can do their life's work-fast, joyfully, and without friction.

Chaps is our new AI-first product, focused on turning a constellation of large-language-model agents into a seamless personal productivity assistant.

About the role

Our AI Product team is looking for an engineer who obsesses over making multi-agent systems robust, observable, and continuously improving. You'll build the test harnesses, evaluation pipelines, and monitoring layers that keep dozens of collaborating agents on-task, on-budget, and on-time.

In practice, that means:
* Designing automated evals that exercise complete agent workflows-catching regressions before they reach users.
* Instrumenting every prompt, tool-call, and model hop with rich telemetry so we can trace root causes in minutes, not days.
* Creating feedback loops that turn logs, user ratings, and synthetic tests into better prompts and safer behaviors.
* Future-proofing agentic systems by allowing quality to evolve with LLM intelligence.
You will partner with product, research, and infra to ship an AI assistant users can trust-no surprises, no downtime.

What we're looking for

You must have:
* Hands-on experience with LLM evaluation frameworks (e.g., OpenAI Evals, LangSmith, LLM-Harness) and a track record of turning eval results into product-ready gating.
* Observability chops -you've wired up tracing/metrics for distributed systems (OpenTelemetry, Prometheus, Grafana) and know how to set SLOs that actually matter.
* Prompt-engineering fluency -few-shot, function-calling, RAG orchestration-and an instinct for spotting ambiguity or jailbreak vectors.
* Production-grade Python/TypeScript skills and comfort shipping through CI/CD (GitHub Actions, Terraform, Docker/K8s).
* A bias for experimentation : you automate A/B tests, cost-latency trade-off studies, and rollback safeguards as part of the dev cycle.
It would be great if you have:
* Experience scaling multi-agent planners or tool-using agents in real products.
* Familiarity with vector databases, semantic diff tooling, or RLHF/RLAIF pipelines.
* A knack for weaving human feedback (support tickets, thumbs-downs) into automated regression tests.
Our Culture
* Think differently. We value novel ideas over legacy playbooks-and we give you room to explore.
* People first. You instrument systems so users never feel the bumps; you collaborate so teammates never feel stuck.
* Pragmatic craftsmanship. We ship fast, but we measure twice-data accuracy, latency budgets, and reliability all matter.
* Clear communicators. You translate metrics into stories that product managers and designers understand, sparking better decisions.
Join us if you want to make AI that works-every request, every time.

Apply

Create E-mail Alert

Save

Similar job

Reliability engineer

London

Digital Realty (UK) Limited

Reliability engineer

Similar job

Service reliability engineer

London

Brightbox GRP Ltd

Reliability engineer

£550 - £570 a day

Similar job

Reliability engineer

London

Digital Realty (UK) Limited

Reliability engineer