We are hiring a Senior ML Infrastructure Engineer to build, enable and operate the core platform that powers Chemify’s machine learning and scientific AI computing workloads. This role sits at the intersection of distributed systems engineering, machine learning infrastructure, scientific computing, and platform engineering.
You will build and operate the operational backbone of the ML platform, ensuring that pipelines run reliably across Kubernetes clusters, on‑premise GPU infrastructure, and serverless compute environments. The systems you build will support ML engineers and computational chemists running workloads from large‑scale model training to molecular simulation.
Key Responsibilities
* ML Pipeline Orchestration: implement routing logic dispatching workloads to appropriate compute backends; maintain workflow reliability including retries, dependency management, and failure recovery.
* Linux Administration: Server administration and support including security and scaling.
* Kubernetes Platform Operations: Operate clusters for ML training, inference, and batch workloads; maintain container build pipelines and GitOps deployment workflows; optimise cluster scheduling, autoscaling, and GPU utilisation.
* HPC / GPU Compute Integration: Integrate orchestration systems with HPC job schedulers; maintain execution paths for workloads running on GPU clusters; ensure artifacts and results from HPC jobs are captured and versioned.
* Model & Experiment Lifecycle: Operate model registry and experiment tracking platforms; ensure training runs are reproducible and linked to code and datasets; support promotion of models from staging to production.
* Data Versioning & Pipeline Traceability: Implement dataset versioning and lineage tracking across ML pipelines; ensure predictions are traceable to model versions and datasets; maintain reproducible ML training pipelines.
* Platform Tooling & Developer Experience: Develop platform CLI tools and pipeline templates; maintain base container images used for ML workloads; improve developer workflows for ML engineers and scientists.
* Observability, Security & Governance: Implement monitoring, logging, and alerting across orchestration systems; maintain infrastructure as code for platform resources; ensure workloads are traceable to source code, container images, and execution environments.
What You’ll Bring
* Degree in Science, Engineering or related field (or equivalent practical experience).
* Experience operating workflow orchestration platforms.
* Experience with containerisation and CI/CD pipelines.
* Experience with cloud infrastructure such as AWS & GCP.
* Experience operating distributed systems in production.
* Experience in Cyber Security and operating in regulated environments.
Beneficial Skills
* Argo Workflows or Kubernetes workflow engines.
* SLURM or other HPC job schedulers.
* ML experiment tracking tools such as Weights & Biases or MLflow.
* Data versioning or lakehouse technologies such as LakeFS, Iceberg, or Delta Lake.
* Scientific computing environments.
* Internal developer platform or CLI tooling experience.
#J-18808-Ljbffr