Social network you want to login/join with:
AI ML Lead Site Reliability Engineer, Glasgow
Client:
Location:
Glasgow, United Kingdom
Job Category:
Other
EU work permit required: Yes
Job Reference: 52b06e958328
Job Views: 2
Posted: 23.05.2025
Expiry Date: 07.07.2025
Job Description:
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
As an AI ML Lead Site Reliability Engineer at JPMorgan Chase within the AIML Data Platform Team, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Responsibilities include:
* Demonstrating and championing site reliability culture and practices, exerting technical influence throughout your team
* Leading initiatives to improve the reliability and stability of applications and platforms using data-driven analytics
* Collaborating to identify service level indicators and establish service level objectives and error budgets with stakeholders
* Exhibiting high technical expertise and proactively solving technology-related bottlenecks
* Acting as the main contact during major incidents to identify and resolve issues promptly
* Partnering with product engineering teams to ensure reliability and performance of AI/ML systems
* Developing observability, security, automation, and fin-ops tools and orchestration
* Providing strategic technology leadership and defining standards for reliability and automation frameworks
* Building cross-functional relationships to deliver solutions and resolve user problems
* Debugging and solving production issues, identifying root causes, and remediating
* Participating in on-call rotations, incident management, and escalation workflows
Required qualifications include:
* Formal training or certification in site reliability engineering concepts and applicable experience
* Deep proficiency in reliability, scalability, performance, security, and enterprise system architecture
* Fluency in programming languages such as Python, Java Spring Boot, or .Net
* Deep knowledge of software applications and technical processes, with emerging expertise in specific disciplines
* Proficiency in observability tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk
* Experience with CI/CD tools like Jenkins, GitLab, Terraform
* Experience with containerization and orchestration tools such as ECS, Kubernetes, Docker
* Expertise in SRE principles, reliability, scalability, and performance of applications and infrastructure
* Proficiency in Python programming and Infrastructure as Code tools like Terraform
* Experience designing distributed systems and cloud-native architectures in AWS
* Self-motivated with a strong sense of ownership and urgency
Preferred qualifications include prior experience in AI, ML, or Data engineering, expertise in Kubernetes, automation frameworks, and observability/telemetry tools.
#J-18808-Ljbffr