Overview
Our client in the Life Science industry is a startup in stealth mode backed by strong funding. They are seeking a Principal Data Engineer to lead the data and infrastructure systems powering the foundation model transforming drug development.
Responsibilities
* Lead data and infrastructure systems powering foundation model initiatives in drug development.
* Own data workflows end-to-end, from extraction and transformation to clean Parquet outputs for machine learning teams.
* Collaborate closely with wet lab teams; practically understand assays and protocol development.
* Set up cloud data infrastructure from scratch, including compute, storage, networking, and access controls.
* Build reliable, repeatable pipelines with testing, version control, and clear documentation.
* Maintain data quality, lineage, and monitoring; implement sound data modeling practices.
Qualifications (Requirements)
* Principal-level data engineering experience in life sciences is essential.
* End-to-end ownership of data workflows from extraction to machine learning-ready outputs (Parquet).
* Hands-on familiarity with genomics data, including raw FASTQ files and Illumina sequencer outputs.
* Experience with metabolomics data, particularly untargeted mass spectrometry.
* Strong collaboration with wet lab teams and practical understanding of assays and protocol development.
* Cloud data infrastructure built from scratch (compute, storage, networking, access controls).
* Strong Python and SQL skills; proficient in data modeling, data quality, lineage, and monitoring.
* Ability to design and maintain reliable pipelines with testing and documentation.
Preferences
* Experience building data lakes or lakehouses and automating batch workflows (e.g., Airflow).
* Familiarity with NGS pipelines (quality control, alignment/assembly, variant calling) and mass spectrometry data analysis.
* Use of Infrastructure as Code (Terraform), containerization (Docker), and CI/CD for deploying data systems.
* Prior 0-to-1 startup experience and close collaboration with ML and biology teams.
Why Join
* Design and build cloud infrastructure and data pipelines powering distributed ML training and scalable biological data workflows—without legacy constraints.
* Work with first-of-their-kind, multi-modal datasets to support foundation model training at AlphaFold scale; this is a builder role with deep technical ownership.
* Join as a founding member of the engineering team with significant equity and end-to-end system ownership.
* See your work directly enable drug discoveries that will impact millions, collaborating with world-leading scientists in microbiome research and machine learning.
Location: London - 3 days onsite
Salary: £ 80 000 - £ 120 000 plus equity
#J-18808-Ljbffr