Requirements
* Bachelor’s Degree or equivalent experience in Computer Science, Engineering, or a related field
* 5+ years of hands‑on technical experience in SRE, Platform Engineering, Infrastructure, or related roles
* Strong experience with AWS (or Azure), including services such as EKS, ECS, EC2, networking, IAM, and managed services
* Solid understanding of cloud security principles and experience collaborating with security teams
* Strong background in Linux systems administrations
* Proven experience designing and operating observability platforms, including monitoring, logging, and alerting
* Hands‑on experience with Datadog for metrics, logs, APM, and alerting
* Strong understanding of SRE principles, including SLOs, error budgets, incident management, and reliability engineering
* Experience working closely with architecture and engineering teams on system design and delivery
* Experience with cloud cost optimization strategies and tooling
* (Desirable) Experience supporting multi‑cloud or hybrid environments
* (Desirable) Exposure to Infrastructure as Code (e.g., Terraform, CloudFormation)
* (Desirable) Experience in large‑scale, complex, or regulated environments
* (Desirable) Knowledge of vector databases and RAG architectures for building internal SRE knowledge assistants
* (Desirable) Knowledge of Generative AI and LLM platforms (e.g., Claude, Amazon Bedrock)
* Strong technical authority with the ability to influence design and operational decisions
* Highly collaborative, comfortable working across architecture, engineering, security, and operations teams
* Calm and methodical under pressure, especially during incidents and critical issues
* Pragmatic problem‑solver who balances reliability, security, cost, and delivery speed
* Clear communicator, able to explain complex technical concepts to diverse audiences
What the job involves
* We are evolving our Site Reliability Engineering capabilities to strengthen reliability, observability, security, and operational excellence across our Risk Intelligence division
* As a Senior SRE, you will be a senior hands‑on technical person help shape the foundations of reliability across both new and existing platforms
* You will collaborate with Architecture, Engineering, Security, and Platform teams to ensure reliability is built into systems from day one
* While this is not a people‑management you will work closely with global teams and may occasionally be called upon for major incidents or critical issues
* This position requires a highly proactive, hard‑working expert with strong leadership presence and ownership of platform reliability outcomes
* We are looking for a person who is passionate about reliability engineering and who bring a continuous improvement approach to everything they do!
* Lead the establishment of SRE foundations for new projects building environments, monitoring, alerting, and ensuring operational readiness from day one
* Define, implement, and champion observability standards, tooling, and guidelines across metrics, logs, traces, and SLIs/SLOs
* Design and evolve monitoring and alerting solutions that improve visibility, reduce toil, and strengthen system health
* Continuously drive reliability improvements across our environments through incident reduction, performance tuning, and building resilient patterns
* Partner with Security teams to ensure our platforms meet compliance, security, and risk‑management expectations
* Influence architectural and design decisions through data‑driven cloud cost optimization and efficiency initiatives
* Be a technical leader and mentor supporting engineers, shaping engineering standards, and fostering a culture of learning and development
#J-18808-Ljbffr