Lead site reliability engineer (snowflake/terraform/linux)

Cardiff

Partnerize

Site reliability engineer

€80,000 a year

Posted: 8h ago

Offer description

Who We Are

At Partnerize, we're on a mission to transform the way businesses grow. We've built the leading partnership automation platform that empowers brands to discover, engage, and convert their audiences at scale. From affiliate marketing to influencer collaborations, we help our clients build and manage profitable partnerships that drive real results. We're a team of passionate problem‑solvers who are dedicated to helping our clients win in the ever‑evolving world of digital marketing.

Why Join Us

We're looking for passionate, talented people who want to be part of a winning team. At Partnerize, you'll find a culture of collaboration, innovation, and respect. We're guided by our core values, and we're committed to creating an environment where everyone can do their best work. We also offer a competitive salary, generous benefits, and a flexible work environment that allows you to thrive both personally and professionally. If you're ready to grow your career and make a difference, we'd love to hear from you.

Job Summary

This is a captivating and exciting time to join Partnerize. We are at a pivotal point in our tech progression, looking to significantly expand our technical estate, scale the platform, and replace existing legacy systems with modern solutions. You will play a vital role in the ongoing operationalisation and management of our entire platform portfolio found in our on‑prem datacentres and AWS cloud: Partnerize, BrandVerity, Ascend, and our recent acquisition, Konnecto. While there will be a key focus on integrating and supporting Konnecto's advanced data and AI layers, a critical pillar of your mission will be spearheading the development of our enterprise on‑prem containerisation solution. This initiative is designed to fundamentally shift our engineering culture towards a "you build it, you own it" model. By providing robust, automated container platforms, you will empower our Engineering teams to deploy quickly and independently, significantly reducing the bottleneck created by relying solely on the TechOps department.

We are looking for a Lead SRE who is both a deep technical expert and a capable mentor. In this role, your primary responsibility is ensuring our diverse, hybrid systems remain available, scalable, and secure. You will act as an authoritative Subject Matter Expert (SME), championing developer autonomy, driving IT systems security policies, and working closely with the security compliance team to protect our platforms from threats while driving continuous integration and delivery. This role will report into the SRE and Application Manager providing them with technical guidance and recommendations while being the technical lead for the SRE team.

The Team

You will be responsible for ensuring the continuous development and progression of team members. We are looking for a player/coach who can mentor, empower, and up‑skill talent. We have a mix of technical generalists, specialists, and junior engineers; you will help identify their strengths and constructively develop areas of weakness, guiding their technological career paths as we transition to a DevOps‑centric operating model.

The Operational Reality

You will operate in a fast‑paced, high‑velocity environment where your work directly and visibly shapes the company's architectural future. This requires a highly adaptable and pragmatic leader who can balance strategic project delivery with hybrid‑estate maintenance. By applying modern incident management frameworks to troubleshooting and ticket management, you are responsible for ensuring all issues across our estate are addressed decisively and efficiently.

As a Lead SRE, You Will:

* Strategic & Operational Management
o Developer Empowerment & Containerisation
+ Collaborate on the design, build, and rollout of a robust containerisation strategy (Kubernetes/Docker). Your goal is to assist in delivering a platform that enables Engineering teams to take full ownership of their code from build to deployment.
o Reliability & Error Budgets
+ Define Service Level Indicators (SLIs), set Service Level Objectives (SLOs), and manage Error Budgets to balance feature velocity with platform stability.
o Hybrid Platform Engineering & Konnecto
+ Build software and systems to manage platform infrastructure across on‑prem and AWS. Take the lead technical role in integrating and modernising Konnecto's architecture, ensuring its data ingestion and AI logic layers scale securely.
o FinOps / Cloud Cost Optimisation
+ Manage, monitor, and optimise cloud infrastructure spend across our hybrid environments, ensuring architectural decisions are both highly performant and cost‑effective.
o CI/CD Pipeline Responsibility
+ Responsible for the continuous improvement, delivery, and integration pipelines to facilitate rapid engineering velocity.
* People Leadership & Talent Development
o Mentorship
+ Deliver coaching sessions to the team and individuals, acting as a technical escalation point and fostering a culture of knowledge sharing.
o Workload Management
+ Scope the work coming into the SRE team, prioritise hybrid‑estate maintenance vs. project delivery, and delegate tasks to team members to ensure prompt resolution.
* Security & Architecture
o Design & Threat Modelling
+ Produce production‑grade application security designs. Perform design reviews and threat modelling of our services and products.
o Security Strategy
+ Drive improvements to Partnerize platforms' security through strategic planning, vulnerability assessments, and security testing.
* Incident Management & Toil Reduction
o Toil Reduction Champion. Reduce automation, continually identifying manual, repetitive operational work and engineering it out of existence.
* Post‑Mortems & Escalation
o Act as the ultimate escalation point for complex support incidents, participate in the On‑Call rotation, lead blameless post‑mortems, conduct Root Cause Analysis (RCA), and aggressively track metrics like Mean Time To Recovery (MTTR).
* General Duties
o Consulting & Planning
+ Participate in system design consulting, platform management, and capacity planning.
o Escalation Support
+ Act as the ultimate escalation point for complex support incidents and assignments while maintaining a high level of quality.
o On‑Call
+ Participate in the On‑Call Rotation.

Essential Knowledge, Skills and Experience

Core Competencies

* Technical Ability
o Highly proficient SME capable of reliably applying technical methods, leading cultural technical shifts (e.g., DevOps adoption), and supporting the development of new skills in colleagues.
* Problem Solving & Decision Making
o Capable of making decisions quickly and decisively, weighing options, and approaching problems methodically and innovatively.
* Communication & Influence
o Effectively communicates initiatives to all stakeholders and is capable of procuring buy‑in for key transformational projects (like containerisation rollouts).

Technical Competencies

* Cloud, Hybrid & Containerisation
o Essential knowledge of hybrid architectures, managing both AWS and on‑premise environments. Extensive hands‑on experience designing and managing advanced containerisation environments using Docker, Kubernetes, and Argo Workflows to enable developer self‑service.
* Konnecto Tech Stack & Data Pipelines
o Proven experience managing modern storage layers and databases, specifically MongoDB and Snowflake. Experience supporting complex data ingestion layers, including clickdata streams, S3 raw/parsed ingestion, and Airflow ETL.
* Programming & Automation
o Experience in automation languages (Python or Bash).
o Deep understanding of GitHub and experience implementing or working alongside AI coding tools and practices.
o Knowledge of Infrastructure as Code (Terraform, Ansible).
* Security & Observability
o Experience with security in a DevOps environment. Experience managing observability stacks (e.g., Prometheus, Grafana, and Loki).
* Operations & Troubleshooting
o Exceptional Linux system administration skills. Highly proficient in troubleshooting, diagnosing, and independently solving issues using modern incident management frameworks.

Desirable Knowledge, Skills and Experience

* Innovation & Debt Management
o A keen interest in new technologies, specifically supporting development teams in the refactoring of technical debt.
* Legacy Databases
o Strong experience with relational databases (MySQL, PostgreSQL, Redis).
* Data Streaming
o Experience with data streaming and queuing technologies, specifically Apache Kafka and Druid.
* Web & Storage
o Knowledge of Nginx (or other web server technologies) and storage technologies like Gluster.

UK Benefits & Perks

* 25 days holiday in addition to bank holidays
* Enhanced Parental Leave: 6 months full pay for birth parent, 4 weeks non‑birth parent at full pay after one year employment
* 5 extra 'Partnerize Parental Days' each year
* Private Medical Insurance through Vitality
* Enhanced pension contributions
* Cycle to Work scheme
* Eye Care Vouchers
* Life Assurance
* Enhanced Wellness Program including access to EAP, Wellness Coaching & Wellness Fridays program
* Regular company events and activities

Our Commitment to Diversity & Inclusion

We are committed to attracting, developing, and advancing our outstanding team members, regardless of race, ethnic identity, sexual orientation, religion, age, gender, gender identity, physical abilities, or any other dimension of diversity. We strive to foster an environment where people can be their authentic selves, raise concerns and innovate, all without fear; where they are treated fairly and respectfully, have equal access to opportunities and resources and can contribute fully to the organisation’s success. Every individual in our business is expected to live this commitment without exception.

#J-18808-Ljbffr

Apply

Create E-mail Alert

Save

Similar job

Site reliability engineer (security cleared)

Cardiff

Profile 29

Site reliability engineer

£65,000 a year

Similar job

Site reliability engineer (security cleared)

Cardiff

Profile 29

Site reliability engineer

Similar job

Site reliability engineer (security cleared)

Newport (Newport)

Profile 29

Site reliability engineer