Key Responsibilities:
* Design, implement, and maintain scalable, highly available infrastructure and services.
* Develop automation scripts and tools to improve system reliability and operational efficiency.
* Monitor and troubleshoot system performance, identifying and resolving issues to minimise downtime.
* Implement and maintain CI/CD pipelines to support efficient software delivery.
* Develop and enforce best practices for security, monitoring, and incident management.
* Collaborate with development teams to enhance application performance and stability.
* Create detailed documentation and conduct post-incident reviews to identify root causes and implement long-term solutions.
Essential Skills and Experience:
* Proven experience in Site Reliability Engineering, DevOps, or similar roles.
* Strong understanding of cloud platforms (AWS, Azure, or GCP) and containerisation technologies (Kubernetes, Docker).
* Proficiency in scripting languages such as Python, Bash, or Go.
* Hands‑on experience with monitoring and observability tools like Prometheus, Grafana, and the ELK stack.
* Familiarity with infrastructure-as-code tools like Terraform or Ansible.
* Solid understanding of networking concepts and system security best practices.
* Excellent problem‑solving skills and a passion for automation and continuous improvement.
Desirable:
* Certifications in cloud platforms or DevOps tools.
* Experience with large-scale distributed systems.
This role offers the opportunity to work on mission‑critical projects in a fast‑paced and collaborative environment, driving innovation and reliability in our technology ecosystem.
Rullion celebrates and supports diversity and is committed to ensuring equal opportunities for both employees and applicants.
#J-18808-Ljbffr