Get AI-powered advice on this job and more exclusive features.
Direct message the job poster from Arrayo
We are looking for a Data Engineer to join our cross-disciplinary team and help shape the future of scientific data management in drug discovery. In this role, you’ll play a central part in developing scalable, metadata-rich data products that adhere to FAIR principles, working hand-in-hand with platform, scientific, and AI/ML teams.
Your contributions will accelerate cheminformatics research by building and maintaining curated datasets and robust data pipelines. These pipelines support analytics, modeling, and deep learning within platforms such as AWS, Azure Databricks, and Domino Data Lab.
Key Responsibilities
Develop, deploy, and maintain scalable, cloud-native data pipelines for molecular modeling and related domains.
Operationalize architectural blueprints using modern orchestration frameworks and cloud services (AWS/Azure).
Partner with scientists, cheminformaticians, and data scientists to understand domain-specific requirements and deliver efficient, reusable data solutions.
Process and integrate molecular property datasets, embedding rich metadata to maximize downstream value for AI/ML applications.
Establish data quality, lineage, and governance standards in line with FAIR principles, ensuring reproducibility, traceability, and compliance.
Enable interactive dataset exploration through tools like Spotfire.
Shape schema design, enrich metadata, and develop APIs for reliable and flexible data access.
Optimize storage and compute performance across data lakes and warehouses (e.g., Delta Lake, Parquet, Redshift).
Document data contracts, pipeline logic, and operational best practices to ensure long-term sustainability and effective collaboration.
Required Qualifications
Demonstrated experience as a data engineer in biopharmaceutical or life sciences, particularly supporting drug discovery or translational research.
Hands-on work with molecular structure data, computed properties, simulation outputs, or imaging datasets.
Proficiency in Python (including Pandas or PySpark) and SQL, with exposure to ETL/orchestration tools such as Airflow or dbt.
Strong knowledge of cloud-native services on AWS (e.g., S3, Glue, Lambda, Athena) and Azure (Data Factory, Data Lake).
Track record of collaborating with scientific teams and translating research needs into scalable data solutions.
Preferred Qualifications
Experience with cheminformatics libraries (e.g., RDKit, Open Babel, CDK).
Familiarity with scientific data standards, ontologies, and best practices for metadata capture.
Understanding of data science workflows in computational chemistry, bioinformatics, or AI/ML-driven research.
Orchestration & ETL: Apache Airflow, Prefect
Scientific Libraries (Preferred): RDKit, Open Babel, CDK
Seniority level
* Seniority level
Mid-Senior level
Employment type
* Employment type
Full-time
Job function
* Job function
Engineering, Research, and Information Technology
* Industries
Biotechnology Research, Pharmaceutical Manufacturing, and IT Services and IT Consulting
Referrals increase your chances of interviewing at Arrayo by 2x
Sign in to set job alerts for “Data Engineer” roles.
Boston, MA $170,000.00-$240,000.00 4 months ago
Boston, MA $145,000.00-$204,000.00 2 weeks ago
Bedford, MA $80,000.00-$100,000.00 1 week ago
Boston, MA $225,590.00-$235,400.00 8 hours ago
Boston, MA $130,000.00-$180,000.00 4 months ago
Boston, MA $177,406.00-$196,900.00 8 hours ago
Boston, MA $93,500.00-$133,600.00 8 hours ago
Boston, MA $80,000.00-$100,000.00 2 weeks ago
Worcester, MA $112,597.33-$152,810.66 1 month ago
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr