Jobs
My ads
My job alerts
Sign in
Find a job Career Tips Companies
Find

Senior infrastructure engineer | kubernetes | docker | terraform | python | gpu | onsite, london

Slough
Enigma
Infrastructure engineer
Posted: 29 April
Offer description

Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London


About the company

We are an early-stage AI company building infrastructure for long-horizon reinforcement learning: agents that operate for extended periods and execute tools within high-fidelity environments. The team has deep experience in large-scale AI systems and open-source ML, and the company is well funded by experienced operators and technical leaders in the field.


We build environment infrastructure to train and evaluate agents on frontier tasks such as automated research and scientific discovery. Our customers include leading AI research organisations and fast-growing, AI-native startups.


Technical stack

* Managed Kubernetes (cloud-based)
* Custom autoscaling systems (Python / Go)
* Redis
* Distributed compute frameworks (e.g. Ray)
* Observability stack (OpenTelemetry-style)
* Infrastructure-as-code (Terraform, Helm)
* 50+ containerised evaluation environments


What you’ll do

Own the Kubernetes runtime for agent environments

* Own scheduling, lifecycle management, stability, and operations for long-running, failure-prone workloads
* Operate and evolve a production Kubernetes platform supporting multi-hour or multi-day agent runs


Improve environment infrastructure for long-horizon training and evaluation

* Maintain a large suite of containerised evaluation environments (ML benchmarks, code execution, scientific tasks) with fast cold-start times
* Optimise GPU utilisation and scheduling for distributed workloads
* Design storage patterns for large datasets, model checkpoints, and ephemeral session state
* Improve environment bootstrap times and resource efficiency through image layering and caching strategies


Make observability excellent

* Implement metrics, logs, and traces that enable fast root-cause analysis
* Build dashboards and alerting tied to SLOs (e.g. rollout success rate, environment health, tool latency, queue time)
* Create debugging playbooks for common failure modes such as OOMs, memory leaks, performance regressions, and network or storage issues


Reliability engineering

* Design retry and backoff strategies for long-running agent sessions that may fail mid-execution
* Implement session recovery mechanisms such as checkpointing and idempotent operations
* Build graceful degradation paths for node failures, OOMs, and GPU errors without losing progress
* Create runbooks for common failure modes (e.g. sidecar health timeouts, stream lag, pod eviction cascades)
* Develop chaos-testing strategies for multi-hour runs (network partitions, node drains, API rate limits)
* Define and track SLOs for session creation latency, environment availability, and tool execution success rates


Security and sandboxing for tool-using agents

* Harden container isolation for untrusted code execution (e.g. sandboxed runtimes or microVM-based approaches)
* Implement network policies to restrict outbound access from evaluation environments
* Design secrets management for API keys used by agent tools, including rotation and least-privilege access
* Build audit logging for tool invocations and filesystem access
* Implement rate limiting and circuit breakers for external API calls made by agents


Must-have experience

* Deep, hands-on production experience operating Kubernetes, including:
* Resource requests and limits, affinity/taints, priorities, autoscaling, and preemption
* Debugging networking, DNS, storage performance, and node health issues
* Strong distributed-systems fundamentals: idempotency, retries, failure domains, and incident response
* Practical observability experience with metrics, structured logging, and tracing
* Ability to build internal tools in Python and/or Go
* Infrastructure-as-code and automation experience (Helm, scripting, GitOps-style workflows; Terraform a plus)
* Experience using Redis for high-throughput, session-oriented workloads


Nice-to-have experience

* Experience with machine learning systems or language models
* Expertise in a specific infrastructure domain
* ML or reinforcement learning training infrastructure (checkpointing, distributed training, GPU scheduling)
* Building custom Kubernetes controllers, operators, or autoscalers
* Sandboxing technologies for untrusted code execution
* Distributed compute frameworks (e.g. Ray, Dask, Spark)
* Deep expertise in container runtimes, Linux performance tuning, or networking


Compensation

* Competitive salary and meaningful equity
* Early-team impact with direct ownership and high leverage


Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

Apply
Create E-mail Alert
Job alert activated
Saved
Save
Similar job
Infrastructure engineer: highway & drainage hybrid
Guildford
WSP
Infrastructure engineer
€37,500 a year
Similar job
Infrastructure engineer
Hemel Hempstead
Steria Recruitment
Infrastructure engineer
€50,000 a year
Similar job
Water infrastructure engineer graduate — hybrid role
St Albans
Aecom
Infrastructure engineer
€32,000 a year
See more jobs
Similar jobs
Engineering jobs in Slough
jobs Slough
jobs Berkshire
jobs England
Home > Jobs > Engineering jobs > Infrastructure engineer jobs > Infrastructure engineer jobs in Slough > Senior Infrastructure Engineer | Kubernetes | Docker | Terraform | Python | GPU | Onsite, London

About Jobijoba

  • Career Advice
  • Company Reviews

Search for jobs

  • Jobs by Job Title
  • Jobs by Industry
  • Jobs by Company
  • Jobs by Location
  • Jobs by Keywords

Contact / Partnership

  • Contact
  • Publish your job offers on Jobijoba

Legal notice - Terms of Service - Privacy Policy - Manage my cookies - Accessibility: Not compliant

© 2026 Jobijoba - All Rights Reserved

Apply
Create E-mail Alert
Job alert activated
Saved
Save