HCL Technologies is a $13 bn Technology Services company that takes tremendous pride in helping Customers through their Digital Transformation Journey. Our Sharp focus on Mode 1-2-3 strategy has helped us become the fastest large Tech services company globally.
Incident Manager
Mandatory Skills - Jira, APM and Automation tools knowledge.
Working very closely with the InternalIT, Infrastructure and tribe DevOps teams to provide:
o Incident ticket management, escalations and stakeholder communications during an incident: primarily to support the Marco Polo production application.
o Including proficient, and experienced management of critical/major incidents in a tier 1 payments service.
o Support for DevOps technical input and triage, for instance log reviews, incident tracing to identify issues, and call the right support teams when the issue cannot be directly resolved.
o Work with IT teams, to identify solutions and support excellent service quality in the non-functional area, with an aim to have first class resilience, reliability and stability of the service.
o Review / assess monitoring and alerting implementations, with the tribe SMEs, to ensure the operational state of the service is visible at all times.
o Build, review and improve service level reporting and quality metrics (through incident metrics/reviews).
o Work with the tribes to coordinate and lead Service Recovery activities where required.
o Partner with tribe leads/service owners and development teams to ensure that solutions adhere to non-functional standards, ensuring reliability, resilience and scalability.
o Conduct and lead post incident reviews, work with teams to ensure problem records are logged, tracked and resolved in a timely manner (including root cause analysis & resolution).
o Build / review /improve service dashboards that show the current status of the whole Marco Polo production environment, and be able to report on recent performance.
o Build ability to understand and evaluate ongoing impacts of releases to ensure no downward trends in reliability and performance.
o Work with tribes, giving support for build/improvement of end to end business flow synthetic monitoring/automated regression tests to allow early visibility of production problems.
o Working with Atlassian Jira Service Management/Opsgenie/StatusPage to enhance and improve ticket handling, alerting automation and dashboards to provide MI.