Seeking a Site Reliability Engineer to join our team in central London.
The ideal candidate will be at the forefront of delivering enhanced reliability, performance, and quality to a key national security customer.
You will be part of a growing team that creates a culture of continuous improvement and plays a pivotal role in revolutionising how systems are developed and supported.
This role combines operational support with software engineering, allowing you to design tools and applications that monitor and improve system health.
As part of a wider programme, you will be integral to supporting the customer's critical mission.
* Key Responsibilities:
* Support and maintain critical services, enhancing the availability, performance, and stability of core mission applications.
* Participate in the 24/7 on-call rota, supporting production systems outside business hours, with additional on-call allowances and overtime benefits.
* Focus on automation to reduce manual operations work (e.g. incident tickets, on-call) to improve efficiency.
* Collaborate with development teams, advising on best practices for system design and implementation.
* Design and deploy monitoring tools to provide intelligent insights into system health, customising tools where necessary.
* Understand the relationship between software and infrastructure, ensuring systems are scalable and resilient to failure.
* Participate in the wider DevOps/SRE community, sharing knowledge and best practices across the organisation.
Requirements
* Experience and Skills:
* Experience or enthusiasm for software development in web technologies and object-oriented programming.
* Familiarity with database technologies such as Oracle SQL, MongoDB, or Postgres.
* Proficiency with Linux and Windows command lines (e.g. Bash, PowerShell).
* Experience with monitoring large systems using tools like Grafana, Prometheus, ELK, and Splunk.
* Knowledge of Agile methodologies and tools like Atlassian.
* Strong troubleshooting skills across various levels of the application stack.
* Familiarity with ITIL processes.
* Experience with microservices architectures and container platforms like Docker, Kubernetes, and OpenShift.
* A passion for learning new technologies and solving complex problems.
* Awareness of emerging tech trends and tools in the SRE space.
Benefits
This role offers a range of benefits, including a competitive salary, opportunity for professional growth and development, and a collaborative and dynamic work environment.
About the Role
We are looking for a skilled and motivated Site Reliability Engineer to join our team in central London. If you have experience in software development, system administration, and agile methodologies, we encourage you to apply.