Job Overview: We are seeking a dedicated and experienced Site Reliability Engineer (SRE) to join our dynamic team. The SRE will be responsible for ensuring the reliability, performance, and availability of our critical systems and services. This role requires a blend of software engineering and operations skills to build and run large-scale, distributed, fault-tolerant systems. Key Responsibilities: System Reliability and Performance: Design, implement, and maintain scalable and reliable infrastructure. Monitor system performance, detect issues, and ensure maximum uptime. Develop and implement strategies for disaster recovery and data backup. Automation and Tooling: Automate repetitive tasks to improve efficiency and reduce human error. Build and maintain tools for deployment, monitoring, and operations. Create and maintain CI/CD pipelines to streamline application delivery. Incident Management: Respond to and resolve incidents, minimizing impact on customers. Conduct post-incident reviews to identify root causes and prevent recurrence. Develop and maintain incident response protocols and playbooks. Collaboration and Communication: Work closely with development teams to integrate reliability into the software development lifecycle. Communicate effectively with stakeholders about system status and health. Provide guidance and mentorship to junior team members. Security and Compliance: Ensure systems comply with security standards and best practices. Implement and maintain security measures, including patch management and vulnerability assessments. Assist in audits and compliance initiatives as required. What you will need: Bachelor's degree in Computer Science, Engineering, or a related field. 4 years of hands-on experience in Site Reliability Engineering or DevOps role. Strong experience in maintaining cloud platforms (e.g., AWS, Azure). Proficiency in programming and scripting languages (e.g., Python, Go, Bash). Experience with infrastructure automation and container orchestration tools - (e.g., Docker, Kubernetes, Terraform, Ansible, Helm etc). Familiarity with continuous integration and deployment tools (e.g., Gitlab CI, Argo workflow ,Argo CD.). Experience in managing distributed systems like Kafka. Experience with monitoring/logging solutions (e.g., DataDog, ELK, Prometheus.) Good understanding of concepts related to computer architecture, data structures and programming practices. Solid understanding of networking, databases, and security principles. Strong debugging / troubleshooting skills. Our Values: We work together We believe in people We won’t accept the ‘way it’s always been done’ We listen to learn We’re trying to do the right thing Benefits: Competitive salary Company Pension & Life Assurance Schemes On-site parking Hybrid Working Subsidised Gym Membership Wellness programmes EQUAL EMPLOYMENT OPPORTUNITY STATEMENT Individuals seeking employment at Camlin are considered without regards to race, colour, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, gender identity, or sexual orientation.