Job Description
Assume a critical role in defining the future of a globally recognized firm and have a direct and significant effect in a realm tailored for top achievers in site reliability.
As an AI ML Lead Site Reliability Engineer at JPMorgan Chase within the AIML Data Platform Team, you will hold a leadership role, demonstrate strong knowledge across multiple technical domains, and advise others on technical and business issues. You will lead resiliency design reviews, break down complex problems for other engineers, act as a technical lead for medium to large-sized products, and mentor team members.
Job responsibilities
* Demonstrate and champion site reliability culture and practices, exerting technical influence across your team.
* Lead initiatives to improve the reliability and stability of applications and platforms using data-driven analytics to enhance service levels.
* Collaborate with team members to identify service level indicators, define service level objectives, and establish error budgets with stakeholders.
* Maintain high technical expertise in one or more domains, proactively resolving technology bottlenecks.
* Serve as the main contact during major incidents, quickly identifying and resolving issues to prevent financial losses.
* Partner with product engineering teams to ensure AI/ML systems are reliable and high-performing.
* Develop observability, security, automation, and fin-ops tools and orchestration solutions.
* Provide strategic technology leadership by defining standards and architectures for reliability and automation frameworks.
* Build strong cross-functional relationships to deliver effective solutions.
* Debug and resolve issues in production, identify root causes, and implement remediation.
* Participate in on-call rotations, incident management, and escalation workflows.
Required qualifications, capabilities, and skills
* Formal training or certification in site reliability engineering concepts with practical experience.
* Deep proficiency in reliability, scalability, performance, security, and enterprise system architecture, with the ability to implement best practices.
* Proficiency in at least one programming language such as Python, Java Spring Boot, or .Net.
* Deep knowledge of software applications and technical processes, with emerging expertise in specific technical disciplines.
* Experience with observability tools like Grafana, Dynatrace, Prometheus, Datadog, Splunk, including monitoring, SLO alerting, and telemetry collection.
* Proficiency with CI/CD tools such as Jenkins, GitLab, Terraform.
* Experience with containerization and orchestration tools like Docker, Kubernetes, ECS.
* Expertise in SRE principles, application and infrastructure reliability, scalability, and performance.
* Skill in programming with Python and Infrastructure as Code tools like Terraform.
* Experience designing distributed systems and cloud-native architectures in AWS.
* Self-motivated with a strong sense of ownership, urgency, and drive.
Preferred qualifications, capabilities, and skills
* Experience in AI, ML, or Data engineering.
* Expertise in Kubernetes and container orchestration.
* Experience developing automation frameworks or AI Ops solutions.
* Experience building observability and telemetry tools.
About Us
J.P. Morgan is a global leader in financial services, providing strategic advice and products to prominent clients worldwide. We value diversity and inclusion, and are committed to equal opportunity employment.
About the Team
Our corporate functions support areas from finance and risk to human resources and marketing, ensuring our company's success and long-term partnerships with clients.
#J-18808-Ljbffr