Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London
About the company
We are an early-stage AI company building infrastructure for long-horizon reinforcement learning: agents that operate for extended periods and execute tools within high-fidelity environments. The team has deep experience in large-scale AI systems and open-source ML, and the company is well funded by experienced operators and technical leaders in the field.
We build environment infrastructure to train and evaluate agents on frontier tasks such as automated research and scientific discovery. Our customers include leading AI research organisations and fast-growing, AI-native startups.
Technical stack
* Managed Kubernetes (cloud-based)
* Custom autoscaling systems (Python / Go)
* Redis
* Distributed compute frameworks (e.g. Ray)
* Observability stack (OpenTelemetry-style)
* Infrastructure-as-code (Terraform, Helm)
* 50+ containerised evaluation environments
What you’ll do
Own the Kubernetes runtime for agent environments
* Own scheduling, lifecycle management, stability, and operations for long-running, failure-prone workloads
* Operate and evolve a production Kubernetes platform supporting multi-hour or multi-day agent runs
Improve environment infrastructure for long-horizon training and evaluation
* Maintain a large suite of containerised evaluation environments (ML benchmarks, code execution, scientific tasks) with fast cold-start times
* Optimise GPU utilisation and scheduling for distributed workloads
* Design storage patterns for large datasets, model checkpoints, and ephemeral session state
* Improve environment bootstrap times and resource efficiency through image layering and caching strategies
Make observability excellent
* Implement metrics, logs, and traces that enable fast root-cause analysis
* Build dashboards and alerting tied to SLOs (e.g. rollout success rate, environment health, tool latency, queue time)
* Create debugging playbooks for common failure modes such as OOMs, memory leaks, performance regressions, and network or storage issues
Reliability engineering
* Design retry and backoff strategies for long-running agent sessions that may fail mid-execution
* Implement session recovery mechanisms such as checkpointing and idempotent operations
* Build graceful degradation paths for node failures, OOMs, and GPU errors without losing progress
* Create runbooks for common failure modes (e.g. sidecar health timeouts, stream lag, pod eviction cascades)
* Develop chaos-testing strategies for multi-hour runs (network partitions, node drains, API rate limits)
* Define and track SLOs for session creation latency, environment availability, and tool execution success rates
Security and sandboxing for tool-using agents
* Harden container isolation for untrusted code execution (e.g. sandboxed runtimes or microVM-based approaches)
* Implement network policies to restrict outbound access from evaluation environments
* Design secrets management for API keys used by agent tools, including rotation and least-privilege access
* Build audit logging for tool invocations and filesystem access
* Implement rate limiting and circuit breakers for external API calls made by agents
Must-have experience
* Deep, hands-on production experience operating Kubernetes, including:
* Resource requests and limits, affinity/taints, priorities, autoscaling, and preemption
* Debugging networking, DNS, storage performance, and node health issues
* Strong distributed-systems fundamentals: idempotency, retries, failure domains, and incident response
* Practical observability experience with metrics, structured logging, and tracing
* Ability to build internal tools in Python and/or Go
* Infrastructure-as-code and automation experience (Helm, scripting, GitOps-style workflows; Terraform a plus)
* Experience using Redis for high-throughput, session-oriented workloads
Nice-to-have experience
* Experience with machine learning systems or language models
* Expertise in a specific infrastructure domain
* ML or reinforcement learning training infrastructure (checkpointing, distributed training, GPU scheduling)
* Building custom Kubernetes controllers, operators, or autoscalers
* Sandboxing technologies for untrusted code execution
* Distributed compute frameworks (e.g. Ray, Dask, Spark)
* Deep expertise in container runtimes, Linux performance tuning, or networking
Compensation
* Competitive salary and meaningful equity
* Early-team impact with direct ownership and high leverage
Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London