My Belfast based client looking for a Site Reliability Engineer to lead reliability strategy for large-scale, production-critical systems. This is a senior, hands-on role where you’ll influence architecture, improve resilience, and drive reliability improvements across multiple engineering teams. You’ll thrive here if you enjoy operating complex production environments, shaping technical direction, and mentoring engineers while staying close to the technology.
As the Lead SRE, you’ll provide technical leadership for product reliability, observability, and operational excellence. You’ll define reliability standards, shape the roadmap, and deliver high-impact improvements across cloud-based platforms.
About the role:
* Lead product reliability strategy, defining and owning the Reliability Roadmap
* Design and implement SLIs, SLOs, and error budgets aligned to customer experience
* Drive observability and monitoring using metrics, logs, and distributed tracing
* Partner with product and engineering leads on reliability, performance, capacity, and DR testing
* Act as a senior escalation point during on-call and major incident response
* Lead post-incident reviews and prioritise both short-term fixes and long-term improvements
* Reduce operational toil through automation and shift-left practices
* Lead production reviews using SLOs, incident, and reliability data
* Represent SRE in architecture and design decisions, prioritising resilience and scalability
* Mentor engineers and champion SRE best practices
* Support the migration of applications to Google Cloud Platform (GCP)
* Optimise capacity, performance, and cost without impacting reliability
* Build proofs of concept (POCs) that can be reused across teams
Requirements:
* 5+ years Hands-on experience in a Site Reliability Engineering role
* Strong knowledge of Linux-based systems and distributed system architectures
* Experience with cloud platforms, ideally Google Cloud Platform (GCP), GCE, and/or GKE
* Strong programming and automation skills (Python, Bash, Terraform, Ansible; Java a plus)
* Experience with CI/CD automation and modern delivery pipelines
* Deep familiarity with observability and monitoring tools (Prometheus, Grafana, Splunk, or similar)
* Solid understanding of networking fundamentals and messaging middleware
* Proven ability to influence technical direction and lead cross-team initiatives
* Excellent communication and stakeholder management skills
Highly Desirable:
* Experience building or supporting financial systems or trading platforms
* Exposure to ultra-low latency (ULL) environments
* Experience working in Agile teams
Apply now or email your CV to shane.doolin@realtime.jobs
Must be Belfast based with full working rights for Northern Ireland
#J-18808-Ljbffr