Overview
LexisNexis Risk Solutions is seeking a Site Reliability Engineering Lead to join our global engineering team. This role will be a key player in refining DevOps practices, agile support and deployment processes to improve the reliability of our public cloud based services. The successful candidate will work in a collaborative, fast-paced environment with On-Call rotation for 24/7, 365 system availability. Our teams are collaborative and forward-thinking, partnering with Development, QA, IT Operations, Customer Operations and Project Management to support critical applications and projects.
Responsibilities
* Lead SRE teams to build and maintain Infrastructures as Code, software services (PaaS and SaaS), security policies and continuous integration / deployment processes.
* Reduce technical debt, apply security hardening and optimise cloud-based environments.
* Maintain critical production services with a focus on uptime and reliability for tier 1 / mission critical 24/7 services.
* Collaborate with diverse teams including Development, QA, IT Operations, Customer Operations and Project Management.
* Lead FinOps processes for continuous review and ongoing cost optimisation.
Essential Skills and Attributes
People (40%)
* Lead SRE teams to ensure skill breadth by rotating staff across products/platforms.
* Share and collaborate with other SRE Managers/Leads/Cloud Centre of Excellence to adopt best practices.
* Mentor and develop direct reports; stay up to date with a fast-paced sector.
* Maintain systems and application documentation for technical and non-technical audiences.
* Set goals, perform ongoing objective setting, and manage direct reports.
Financial (5%)
* Maintain and forecast cloud OPEX spend.
* Ensure cloud spend is reasonable and optimised via FinOps processes.
Customer (20%)
* Accountability across a group of products for uptime and resiliency.
* Ensure 24/7 technical support and SLAs for customers are met.
Technical (30%)
* Examine complex releases to ensure system resilience.
* Drive automation to maximise IaC and reduce traditional operational effort.
* Understand defensive, corrective, detective controls and general application troubleshooting.
* Deliver resilient application stacks via IaC and DevOps practices.
* Monitor and support critical, high revenue business applications.
* Diagnose and resolve complex system and application issues.
* Experience hosting critical apps in public clouds (AWS and/or Azure) via services such as EC2, ECS, AKS or ACA.
* Experience with containerised workloads (Docker) and orchestration (Kubernetes).
Other (5%)
* Manage Customer Reliability Engineering activities driving application monitoring, metrics, incident reviews and long-term actions.
* Support BISO/InfoSec changes and work with security tooling (Qualys, Wiz, Trufflehog, GitHub Advanced Security, etc.).
The opportunity to work on challenging technologies and contribute to the evolution of our technology stack, including AWS, Azure, Docker, Kubernetes and Terraform, while delivering value to customers and society.
#J-18808-Ljbffr