This role is an entry point into the SRE team. You'll work directly alongside senior engineers, learning how we operate production environments, instrument systems with Datadog, and respond to incidents. From the start you’ll contribute to real work, monitoring customer environments, writing runbooks, supporting infrastructure changes, with progressively more ownership as your confidence and knowledge grows.
Responsibilities
- Monitor customer AWS and Azure environments using Datadog, learning to triage alerts, identify signal from noise, and escalating with context.
- Support incident response workflows alongside senior engineers, contributing to post‑mortem documentation and remediation tracking.
- Assist with Datadog onboarding and instrumentation for new customers: agents, integrations, dashboards, monitors, and log pipelines.
- Support infrastructure‑as‑code work (Terraform) for provisioning and configuration changes across customer accounts, under senior review.
- Write and maintain runbooks and operational documentation, clear, accurate, and usable by anyone on the team at 3 am.
- Participate in proactive reliability reviews: alert tuning, capacity checks, dependency mapping, with guidance from senior engineers.
- Contribute to internal tooling and AI‑assisted automation initiatives as part of the wider engineering team.
- Communicate directly with customers on day‑to‑day operational queries with a professional, calm, and clear style.
Qualifications – Must Have
- A degree in Computer Science, Software Engineering, or a related technical discipline or equivalent demonstrable self‑taught fundamentals.
- Comfort with scripting in Bash, Python, or similar; you’ve automated something, even if small.
- Understanding of core observability concepts: what metrics, logs, and traces are and what they tell you.
- Awareness of cloud fundamentals; you know what EC2, S3, VPCs, and load balancers do, even without production experience.
- Clear written and verbal communication; you’ll be in customer‑facing situations from early on.
- Right to work in the UK without sponsorship.
Qualifications – Nice to Have
- Any hands‑on Datadog experience, trial, personal project, or university lab.
- Terraform or any infrastructure‑as‑code exposure.
- Docker or Kubernetes, even containerising a personal project counts.
- A cloud certification (AWS Cloud Practitioner, Azure Fundamentals, or equivalent).
- Experience in a customer‑facing environment, even outside tech.
- Any personal projects involving monitoring, automation, or infrastructure.
Benefits
- 25 days holiday + bank holidays plus a paid day off in your birthday month, taken in the month it falls.
- Holiday grows with tenure: +1 day per year after your second work anniversary, up to 28 days total.
- Enhanced maternity pay: 26 weeks at your full basic salary.
- Enhanced paternity pay: 2 weeks at your full basic salary.
- Datadog, AWS, and Azure certifications paid by the company, contractual, not discretionary.
- AI tooling certifications also funded, staying current is part of the role.
- Flexible working requests from your first day of employment, statutory right, supported in full.
- Company‑provided laptop and peripherals, set up before you start.
- On‑call allowance (in addition to base salary): SREs join a shared rota, typically one week in five or six, reducing as the team grows. Paid £500 per on‑call week, which works out at roughly £5–6k a year on top of salary, varying with the rota size.
- Base salary DOE.
- Remote‑first. UK‑based, async‑friendly.
- Certs funded.
Tech Stack
- Datadog Core observability platform.
- AWS primary cloud, multi‑account.
- Azure secondary cloud workloads.
- Terraform infrastructure as code.
- GitHub Actions CI/CD pipelines.
- Python / Bash automation & tooling.