Principal Machine Learning Engineer (SageMaker, MLOps, Model Governance & Explainability)
We are seeking a Principal Machine Learning Engineer to provide technical leadership across the full lifecycle of machine learning systems powering a new matching platform. This role is accountable for defining ML architecture, establishing engineering standards, driving MLOps maturity, and ensuring that our models are scalable, secure, explainable, and governed to enterprise‑grade standards.
You will contribute to the strategic direction of our ML platform—spanning data pipelines, model development, deployment automation, inference runtime design, telemetry, drift detection, and cross‑account productionisation. You will mentor engineers, influence product and architectural decisions, and ensure that our ML systems operate reliably at scale, underpinned by a robust governance and compliance framework.
This is a highly hands‑on, highly technical, principal‑level role that combines architectural vision with deep practical expertise in ML engineering and AWS-native MLOps.
Key Responsibilities
Technical Leadership & Architecture
* Define the end‑to‑end ML architecture for the matching platform, including data pipelines, model training workflows, inference runtimes, and telemetry ecosystems.
* Lead adoption of best‑in‑class MLOps patterns, platform tooling, and AWS SageMaker capabilities across training, processing, registry, monitoring, and deployment.
* Partner with platform, security, and data engineering teams to implement scalable data lakehouse oriented feature architectures and enterprise‑grade ML governance.
* Champion engineering standards for model quality, documentation, observability, and platform resilience.
Feature Engineering & Data Architecture
* Architect highly scalable, production‑ready feature pipelines within Lakehouse environments.
* Set the technical direction for fallback and resilience strategies.
* Establish and enforce data‑quality guardrails, validation schemas, and monitoring frameworks.
* Drive adoption and standards for enterprise feature stores.
Model Development & Technical Excellence
* Lead the design of ranking, scoring, and similarity models tailored to the matching platform requirements.
* Define model calibration, scoring logic, confidence thresholds, and optimisation strategies.
* Mentor teams on advanced ML techniques using frameworks such as PyTorch, TensorFlow, and XGBoost.
* Review and approve technical designs for complex modeling workflows.
Explainability & Regulatory‑Grade Reasoning
* Establish explainability standards across the ML stack, using SHAP or equivalent frameworks.
* Define patterns to generate regulator‑ready reason codes, aligned with compliance requirements.
* Ensure explainability artefacts are accurate, robust, and traceable across model versions.
ML Deployment & Automation (MLOps)
* Architect automated training, deployment, and retraining pipelines using AWS SageMaker.
* Set standards for model registry usage, automated approvals, and rollback orchestration.
* Drive infrastructure‑as‑code and CI/CD maturity for ML systems across multiple environments.
* Lead design of enterprise‑wide weight‑update patterns and lineage‑aware deployment strategies.
Inference Runtime & Cross‑Account Productionisation
* Architect low‑latency, high‑throughput inference services that meet strict matching platform SLAs.
* Lead the design of secure cross‑account IAM patterns for model consumption.Own end‑to‑end telemetry design, including scoring metrics, latency, error analytics, and SLOs.
* Partner with platform teams to optimise cost, scale, and reliability of inference endpoints.
Monitoring, Drift Detection & Observability
* Define observability standards for feature drift, concept drift, performance degradation, and data integrity.
* Lead the creation of dashboards, benchmarks, and automated alerting across the ML ecosystem.
* Ensure telemetry pipelines adhere to privacy, data minimisation, and compliance policies.
* Drive adoption of proactive failover, shadow‑mode testing, and continuous validation patterns.
Security, Compliance & ML Governance
* Set and enforce ML‑specific security standards including data minimisation, encryption, and PII handling.
* Oversee creation of Model Cards, lineage artefacts, and compliance documentation.
* Ensure ML systems meet governance standards for auditability, reproducibility, versioning, and traceability.
* Collaborate with InfoSec and Risk teams to define ML governance frameworks and secure cross‑environment workflows.
Testing, Validation & Performance Engineering
* Lead validation strategies using golden datasets, behavioural tests, and benchmark suites.
* Architect performance testing for latency‑sensitive inference paths and model hot paths.
* Establish standards for A/B testing, shadow deployments, canary rollouts, and controlled experiments.
Essential
Principal‑Level Skills & Experience
* Proven track record architecting and delivering production ML systems at scale in enterprise environments.
* Deep expertise with AWS SageMaker (training, processing, pipelines, endpoints, registry) and complementary AWS services.
* Expert‑level Python and ML Model frameworks (e.g. PyTorch, TensorFlow, XGBoost).
* Strong thought leadership in MLOps automation, CI/CD for ML, and model lifecycle management.
* Advanced experience designing explainability systems, reason codes, and governance artefacts.
* Expertise in low‑latency inference architectures and real‑time model serving.
* Strong grounding in drift detection, telemetry pipelines, observability patterns, and model QA.
* Experience shaping ML security practices, including cross‑account IAM, data minimisation, and PII‑safe design.
* Ability to influence architecture, mentor senior engineers, and set long‑term technical direction.
Nice to Have
* Experience building or leading feature store adoption.
* Background in ranking, search relevance, entity matching, or similarity modelling.
* Experience designing or governing multi‑account AWS ML platforms.
* Knowledge of distributed training, GPU/accelerator optimisation, and scaling strategies.
* Bachelor’s in a STEM subject (e.g. mathematics, physics, engineering, computer science, or adjacent degrees).
* Master’s or PhD or equivalent experience in STEM desirable but not essential.
Career Stage
Manager
Equal Opportunity Statement
We are proud to be an equal opportunities employer. This means that we do not discriminate on the basis of anyone’s race, religion, colour, national origin, gender, sexual orientation, gender identity, gender expression, age, marital status, veteran status, pregnancy or disability, or any other basis protected under applicable law. Conforming with applicable law, we can reasonably accommodate applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs.
#J-18808-Ljbffr