Responsibilities
* Collaborate closely with the existing SRE teams to build and enhance tooling and automation solutions, enabling faster resolution of issues impacting SLOs and preventing incidents when possible.
* Engage with customers to understand their supportability challenges and SLO attainment concerns, developing sustainable strategies to address recurring issues.
* Serve as the primary technical contact for interfacing with large enterprise customers, managing service escalations, and driving issues toward resolution.
* Design and implement changes to service telemetry to support automation if such data is not already available.
* Improve customer experience through proactive alerting based on utilization, trends, resource health, etc.
* Analyze data to provide operational insights to the Design and Product teams, aiding in the development of features with supportability in mind.
Qualifications
* Extensive technical experience in software engineering, network engineering, or systems administration.
* Operational experience in enhancing service reliability, availability, and performance.
* Ability to navigate ambiguity in a fast-paced environment.
* Strong problem-solving skills, effective communication, and curiosity.
* Expertise in analyzing, troubleshooting, and automating root cause analysis for incidents in large-scale distributed systems.
* Willingness to travel regularly to customer sites in South West UK.
Preferred Qualifications
* Knowledge of HPC systems.
* Experience influencing product architecture and roadmaps to prioritize supportability.
Other Requirements
Must meet Microsoft, customer, and/or government security screening requirements, including passing the Microsoft Cloud Background Check and UK security standards. Microsoft is an equal opportunity employer and provides accommodations for applicants with disabilities.
#J-18808-Ljbffr