Your Team
As part of the Cloud Operations team, you will play a vital role in supporting the People First SaaS platform, a modern, microservices-based HR and payroll solution built in Azure and delivered to hundreds of customers.
Your Impact
As a Senior Site Reliability Engineer, you will help ensure the reliability, scalability and automation of MHR’s People First platform through effective cloud operations, observability and continuous improvement. You will apply SRE principles to build resilient systems and strengthen operational excellence in Azure.
In this role you will:
* Build, deploy and maintain cloud environments through automated processes, ensuring consistent, reliable and scalable platform operations.
* Implement and optimise monitoring, alerting and diagnostics, using observability data to support SLIs/SLOs, reduce MTTR and improve service reliability.
* Collaborate with Platform and Development teams to ensure systems are designed for operability, resilience, performance and effective capacity management.
* Automate provisioning, scaling and configuration management using scripting and IaC tooling to minimise toil and improve repeatability.
* Contribute to incident response and root cause analysis, document and evolve operational standards, and participate in the on-call rota to support platform availability.
What you'll bring to the role and MHR
* Experience with DevOps and Continuous Delivery practices and how they apply to reliable service operation.
* Experience working with backend services built in Java, .NET or similar languages and how these architectures influence deployments, change management and rollback safety.
* Experience implementing logging, metrics and tracing for backend services using tools such as Dynatrace, Azure Monitor, Application Insights or Grafana to inform SLIs/SLOs and reduce MTTR.
* Strong understanding of IaC principles and experience delivering consistent, auditable environments using Terraform and ideally Bicep.
* Handson experience operating cloud hosted SaaS platforms in Microsoft Azure with a focus on resilience, autoscaling, fault tolerance and operational readiness.
* Ability to automate workflows across build, deploy, configuration, drift correction and resilience tasks using PowerShell, Terraform or similar scripting.
* Experience supporting incident response, performing root cause analysis, contributing to post incident reviews and implementing preventive measures across backend services.
* Experience designing or maintaining CI/CD pipelines for Java, .NET or similar codebases, including quality gates, test automation, performance checks and release observability.
#J-18808-Ljbffr