Join us as a Senior Site Reliability Engineer where you'll spearhead the evolution of our digital landscape, driving innovation and excellence.
This role involves applying software engineering techniques, automation, and best practices in incident response to ensure the reliability, availability, and scalability of our systems, platforms, and technology.
Key skills and experience required include:
* Oracle Enterprise Manager (OEM), Oracle Internet Directory (OID), Oracle Database Performance Tuning – SME
* Deep understanding of LDAP protocols and directory services
* SQL Optimization
* Scripting skills (e.g., Python, Bash) and experience with configuration management tools (e.g., Ansible, Puppet, Chef)
* Monitoring system expertise (e.g., Prometheus, Grafana)
Additional valued skills:
* Experience with cloud platforms (AWS, Azure, Google Cloud)
* Knowledge of containerization and orchestration (Docker, Kubernetes)
* Ability to diagnose and resolve production incidents
* Strong interpersonal skills for cross-functional teamwork
This role is based at our Knutsford campus.
Purpose of the role:
To utilize software engineering, automation, and best practices in incident response to maintain system reliability, availability, and scalability.
Accountabilities include:
* Ensuring system performance and scalability through monitoring, maintenance, and capacity planning
* Responding to system outages, analyzing issues, and implementing preventative measures
* Developing tools and scripts to automate operations, improve efficiency, and enhance resilience
* Monitoring and optimizing system performance and resources
* Collaborating with development teams to embed reliability and performance best practices into software development
#J-18808-Ljbffr