AI/ML - Senior Platform Engineer – team.blue
Company: team.blue is an ecosystem of 60+ successful brands across 22 European countries, serving 3.5 million SMB customers. We provide hosting, SaaS, compliance, marketing tools, and team collaboration products as a one‑stop partner for online businesses and entrepreneurs.
Position Overview: We are looking for an experienced Senior AI/ML Platform Engineer to design, build, and maintain our machine learning and AI infrastructure platform. This role is essential for enabling our data science and AI teams to deploy, scale, and manage ML models efficiently across multi‑GPU environments, with a focus on LLM deployment and management.
Key Responsibilities
* Design and implement scalable ML/AI platforms supporting model deployment across multi‑GPU nodes.
* Build and maintain infrastructure for LLM inference serving, optimizing for latency and throughput.
* Develop automated deployment pipelines for machine learning models using containerization and orchestration.
* Create self‑service tools and APIs that enable data scientists to deploy models independently.
* Manage and optimize GPU cluster resources, ensuring efficient utilization and cost management.
* Implement monitoring, logging, and alerting systems for ML workloads and model performance.
* Design disaster recovery and backup strategies for critical ML infrastructure.
* Maintain high availability and reliability standards for production ML services.
* Build CI/CD pipelines tailored for ML model deployment and updates.
* Automate infrastructure provisioning using IaC principles.
* Implement model versioning, rollback capabilities, and A/B testing frameworks.
* Develop automated scaling solutions for varying inference workloads.
* Work closely with data science teams to understand requirements and optimize deployment workflows.
* Provide technical guidance on best practices for model deployment and infrastructure usage.
* Collaborate with security teams to implement secure ML model serving practices.
* Document platform capabilities, procedures, and troubleshooting guides.
Profile
Professional Experience
* 4+ years of experience in platform engineering, DevOps, or infrastructure roles.
* 2+ years of experience specifically with ML/AI infrastructure or platforms.
Technical Skills
* Cloud Platforms: 4+ years of experience with AWS, Azure, or GCP, particularly GPU‑enabled services.
* Containerization: Proficiency with Docker and Kubernetes, including GPU scheduling and resource management.
* Infrastructure as Code: Experience with Terraform, CloudFormation, or similar tools.
* Programming: Strong skills in Python and at least one additional language (Go, Java, or Rust).
* ML Frameworks: Familiarity with PyTorch, TensorFlow, and model serving frameworks (TorchServe, TensorFlow Serving, etc.).
Platform & Operations Experience
* Experience building and maintaining production ML platforms or similar infrastructure (KubeFlow, MLFlow, SageMaker, etc.).
* Knowledge of GPU computing, CUDA, and multi‑GPU distributed computing.
* Understanding of ML model lifecycle management and MLOps practices.
* Experience with monitoring tools (Prometheus, Grafana, ELK stack).
* Experience with streaming data processing (Kafka, Kinesis, Pulsar).
* Familiarity with service mesh technologies and API gateways.
AI/ML Knowledge
* Understanding of large language models (LLMs) and inference optimization techniques.
* Knowledge of model quantization, pruning, and other optimization methods.
* Experience with distributed training and inference across multiple GPUs/nodes.
* Familiarity with vector databases and embedding storage solutions.
Right to Work
Applicants must be eligible to work in the country they are applying for. We do not support relocation packages or sponsorship visas.
ESG & Inclusion
At team.blue, our commitment to caring for the environment and each other is at the heart of everything we do. Our latest impact report showcases our ESG efforts and sustainability goals. Everyone is welcome; diversity & inclusion are at our core. We value respect, openness, and collaboration, and we do not tolerate intolerance.
Job Details
* Seniority level: Mid‑Senior level
* Employment type: Full‑time
* Location: Worcester, England, United Kingdom (note: some posts mention Birmingham – select the relevant location)
Referrals increase your chances of interviewing at team.blue by 2x.
#J-18808-Ljbffr