AI Infrastructure Engineer / MLOps Engineer
Join Lenovo’s AI Technology Center (LATC) – a global AI Center of Excellence – to help shape AI at a truly global scale. We’re building the next wave of AI core technologies and platforms, and we need a highly skilled AI Infrastructure Engineer / AI Operations Engineer to design, build, and maintain the infrastructure and tools necessary for efficient AI model development, deployment, and operation.
Responsibilities:
* AI Infrastructure Design and Implementation: Design, build, and maintain scalable and efficient AI infrastructure, including compute resources, storage solutions, and networking configurations.
* AI Model Deployment and Management: Develop and implement processes for deploying, monitoring, and managing AI models in production environments.
* Automation and Tooling: Create and maintain automation scripts and tools for AI model training, testing, evaluation, and deployment in a continuous integration / continuous delivery (CI/CD) pipeline.
* Collaboration and Support: Work closely with data scientists, engineers, and other stakeholders to ensure smooth operation of AI systems and provide support as needed.
* Performance Optimization: Continuously monitor and optimize AI infrastructure and models for performance, scalability, utilization, and reliability.
* Security and Compliance: Ensure AI infrastructure and models comply with relevant security and regulatory requirements.
Qualifications:
* Bachelor’s or Master’s degree in Computer Engineering, Electrical Engineering, Computer Science, or a related field.
* 8+ years of experience in software engineering, DevOps, or a related field.
* Strong background in computer systems, distributed systems, and cloud computing.
* Proficient in Linux system administration, including package management, user/group management, file system navigation, shell scripting (bash), and system configuration (systemd, networking).
* Proficiency in programming languages such as Python, Java, or C++.
* Experience with AI-specific infrastructure and tools (e.g., NVIDIA GPUs and CUDA).
* Experience with setting up multi-node distributed GPU clusters, leveraging Slurm, Kubernetes or related software stacks.
* Experience with managing high-performance computing (HPC) clusters, including job scheduling, resource allocation, and cluster maintenance.
* Familiarity configuring job scheduling tools (e.g., Slurm).
* Experience with AI infrastructure, model deployment, and management.
* Excellent problem‑solving and analytical skills.
* Strong communication and collaboration skills.
* Ability to work in a fast‑paced, dynamic environment.
Bonus Points:
* Familiarity with AI and machine learning frameworks (PyTorch).
* Familiarity with cloud platforms (AWS, GCP, Azure).
* Experience with containerization (Docker) and orchestration (Kubernetes).
* Experience with monitoring and logging tools (Prometheus, Grafana).
What we offer:
* Opportunities for career advancement and personal development.
* Access to a diverse range of training programs.
* Performance‑based rewards that celebrate your achievements.
* Flexibility with a hybrid work model (3:2) that blends home and office life.
* Electric car salary sacrifice scheme.
* Life insurance.
Location: Edinburgh, Scotland – candidates must be based there, as the role requires working from the office at least three days per week (3:2 hybrid policy).
Seniority level: Mid‑Senior level
Employment type: Full‑time
Job function: Information Technology
Industry: IT Services and IT Consulting
#J-18808-Ljbffr