About Colt Data Centre Services
Colt Data Centre Services (DCS) has over 20 years’ experience in designing, building and operating energy-efficient, reliable data centres – hosting significant financial, media, corporate and cloud wholesale providers across the world.
Our customers are at the heart of everything we do. We endeavour to take a customer-led approach across our operations, striving to serve our customers with a seamless experience no matter what facility or region they are in.
Finding the right solutions for our customers starts with finding the right people for Colt DCS. We believe in creating a healthy, learning environment for our employees to flourish.
Our vision: to be the most customer-centric data centre provider.
About the Role
Reporting to the Senior Problem Manager and working predominantly remotely, you will be responsible for identifying, analysing, and supporting the resolution of DCS service issues to minimise business impact and prevent future incidents.
In doing so, you'll work closely with cross-functional teams to investigate root causes, implement corrective actions, and drive continuous improvement in DCS service delivery. This may also include working with or presenting to DCS customers and/or suppliers.
In practice, this means that you'll manage the lifecycle of infrastructure Problem tickets across our DCS estate. These will include Mechanical & Electrical (M&E), IT and Networking infrastructure across all of our sites in Europe, India and Japan.
You'll also work proactively to analyse issue data and use that to deliver comprehensive recommendations to help mitigate these arising in future where possible.
About you
To be successful, you'll need strong problem management and analysis expertise, supported by knowledge and experience of at least one Root Cause Analysis (RCA) method.
As you'll be investigating, analysing and reporting on a range of infrastructure related issues, you'll need strong hands-on knowledge of Mechanical & Electrical (M&E) related systems and data centre infrastructure. Building Management System experience would be advantageous.
All of this will be underpinned by your strong communication skills and the ability to help technical and non-technical audiences understand issues and how best to resolve them.
What we offer
We offer skill development, learning pathways and accreditation to help our people perform at their best, regardless of role and location.
In addition to offering competitive salaries and incentive plans, a range of benefits and local rewards packages are offered to staff. Colt DCS recognises the importance of a work-life balance.
These are just some of the reasons why Colt DCS is recognised as Great Place to Work Certified UK.
The Role in More Detail
Key Responsibilities
• Risk Identification/Problem Identification and Analysis
o Investigate and analyse incidents to determine root causes (RCA) and patterns.
o Proactively identify recurring issues through trend analysis, with a specific focus on building monitoring systems (BMS) data analysis.
o Risk assessment of non-standard Changes.
• Resolution and Mitigation:
o Ensure timely resolution and closure of problem tickets.
o Identify permanent solutions for recurring incidents.
o Ensure timely escalation of critical issues and coordinate resolution efforts across teams.
• Documentation and Reporting:
o Maintain comprehensive documentation of problems, their resolutions, and preventive measures.
o Prepare detailed RCA reports and graphs to support the technical explanation of a Problem
o Keep documentation for problem management accurate and current.
• Continuous Improvement:
o Prepare presentations on Problem management metrics, trends, and improvements.
o Proactively identify opportunities to enhance service reliability and efficiency.
o Participate in problem management initiatives and projects to drive continuous improvement.
o Deliver internal technical training in areas of expertise.
• Collaboration and Communication:
o Collaborate with stakeholders to gather information and perform thorough problem diagnostics.
o Join and document problem review meetings, providing regular updates.
o Actively participate in Monthly Executive Reviews and other key meetings, lead the meeting if required.
o Work with cross-functional teams to implement corrective actions and drive improvement.
o Support and oversee the onboarding process for new sites, representing Service Operations, with particular focus on ensuring documentation is available and accurate and “snags” are captured.
Requirements
• Hands-on knowledge of Mechanical & Electrical (M&E) systems and Data Centre infrastructure.
• Knowledge of IT and Networking concepts.
• Proven experience in problem management or incident management within a complex DC environment.
• Knowledge and experience with at least one of common RCA methods (5-Whys, Fishbone Diagram, Fault Tree Analysis, Pareto Analysis)
• Strong analytical skills with the ability to analyse data, identify trends, and make data-driven decisions.
• Excellent documentation and reporting in English.
• ITIL certification or knowledge of ITIL framework practices.
• Experience with monitoring systems and ability to analyse monitoring data to detect and pin point issues.
Note that for this role, there will be a need to travel to site locations when required, in addition to a small number of trips to our offices annually for face-to-face team meetings.