Insight Global is looking for an Operations Site Reliability Engineer to help with global operational support for a leading infrastructure software product company’s customer-facing Saas products. You will be part of a team of engineers that demonstrates superb technical competency, operates mission-critical infrastructure and ensures the highest levels of availability (24x7x365), performance and security. This SRE would be part of the critical operations function that is responsible for the monitoring, availability and performance of production services. They would be driving automation to reduce failures, manual tasks and therefore improving overall application performance and availability. As well as responding to stakeholder requests within agreed timescales or SLO, they will also be supporting maintenance activities, critical systems, and the planning of releases related to production applications. They will also coordinate and communicate with impacted stakeholders as per incident management process through to restoration. Finally, they will provide mentorship when necessary to upstream teams (both internal and vendors) to reduce escalations and to continually improve overall experience for customers. This is an opportunity to join an organisation expanding dramatically, whilst also offering a highly competitive salary, bonus and equity package.
Must haves:
• A degree in Systems Engineering, Computer Science or related fields.
• Extensive professional experience working in a large cloud operations setting.
• Strong knowledge and experience administering Linux systems and hands-on experience of variants of Linux distributions.
• Deep expertise in operational experience of working with Amazon Web Services or Google Cloud Platform.
• Experience of working with an automation platform to automate repetitive actions that reduce manual effort.
• Experienced and confident in at least one scripting language such as Perl, shell, Ruby, BASH or Python.
• Familiarity with deployment tools such as Ansible Tower and Jenkins.
• Experience in carrying out large deployments to global infrastructure.
• Experience of system/application administration in a distributed, customer-facing, high-availability and large-scale environments.
Plusses:
• Proficient with orchestration/configuration tools such as Ansible and Terraform (if someone is in cloud).
• Strong working knowledge of networking, packet tracing, understanding latency and throughput in order to pinpoint or resolve application issues.
• Thorough knowledge of HTTP(S), SMTP, TLS/SSL, DNS, LDAP, Kubernetes and Docker containers.
• A strong team player with the ability to grasp new technologies, adapt to change in methodologies, with a focus on delivery.
• Extensive troubleshooting and problem-solving skills with respect to application technologies.
• Ability to communicate effectively at all levels up to senior management.
• Experience of tuning and optimising monitoring systems.
• Ability to remain calm and work well under pressure.
• A keen interest and desire to work within the security arena.