LabID-PROV: Tracking and Sharing Data Provenance with RO-Crate in Lab Integrated Data

Science cluster

LS RI - Life Sciences

Summary

FAIR data management and the ability to track (meta)data from sample collection to the final computed (or derived) datasets have become increasingly crucial due to the increasing size and diversity of data generated across various scientific disciplines. However, to ensure reproducibility, comprehensive provenance metadata also needs to be available. Lab Integrated Data (LabID) is a web-based integrated platform designed to help individual scientists, research groups and core facilities to better manage, annotate and share their experiments, assays, samples and datasets actively, in compliance with FAIR principles. While in LabID processed data can already be stored and connected to its primary data, associated assays and original samples, accurate modelling of both workflow (WF) and WF runs is currently lacking. The LabID-PROV project aims to extend the LabID data model to include these concepts, offering a unified application to manage derived data provenance independently from analysis procedure and platform, and providing a concrete solution to ensure the traceability of derived data. 

LabID-PROV project image
Research domains:
Life Sciences
Partner(s):
European Molecular Biology Laboratory - EMBL (coordinator)
Project team member(s):
Prof. Eileen Furlong, Charles Girardot (EMBL), Jelle Scholtalbers, (LabRise Consulting)

Challenge

Open Science Service, Main RI concerned, Cross-domain/Cross-RI

Managing the increasing volume and variety of (meta)data - from sample collection to derived datasets - poses significant challenges in scientific research. While primary data is regularly shared in repositories, the sharing of final derived datasets remains inadequate. This is often due to the lack of comprehensive provenance metadata that ensures reproducibility and trustworthiness. Furthermore, the diverse languages, tools, and computing environments used by various stakeholders complicate the tracking of workflows and their corresponding metadata, limiting the reusability of shared data and hindering Open Science efforts.

Solution

LabID-PROV proposes to enhance the LabID platform by allowing accurate modelling of both WFs and WF runs. To this end, the project will use several resources in the LS RI Science Cluster, namely, WorkflowHub, a resource for WF indexing, discovery and re-use; RO-Crate, which can be used to package and share research objects including WF and WF runs using the WF and WF Run RO-Crate profiles; Galaxy, a web-based analysis platform; and Zenodo, an open platform for preserving and sharing research output. The main goal is to streamline the import of datasets (and their metadata) described using Workflow Run RO-Crate profiles into LabID. Also, the project will implement use cases using both omics and imaging data, reflecting real-world scenarios, to demonstrate the integration of the new LabID capabilities into the existing Open Science landscape. These use cases will be used to generate online LabID tutorials demonstrating best practices in WF development and the FAIR dissemination of associated derived data with their provenance. Finally, wider dissemination of this work will be achieved by registering the training in the Training eSupport System and through a future LabIDPROV workshop.

Scientific Impact

The LabID-PROV project aims to bridge critical gaps in data traceability to facilitate FAIR sharing of derived data together with their provenance metadata, through the integration of LS RI Science Cluster tools and standards. LabID-PROV not only will enhance the reproducibility of scientific findings but also encourages researchers across various disciplines to adopt robust data management practices. The project will ultimately support a wider community of scientists, enabling them to efficiently share valuable research outputs and contribute to a more sustainable Open Science ecosystem. 


Keywords
LabID platform, RO-Crate, data provenance, data traceability, data reproducibility
Project start date:
Project duration:
24 months

Principal investigator

Prof. Eileen Furlong
EMBL
BIO

Eileen Furlong is head of the Genome Biology dept and Senior Scientist at EMBL. She studied biochemistry at Univ. College Dublin, where she obtained her Ph.D. and moved to developmental biology for her postdoc at Stanford Univ. She became a group leader at EMBL in 2002, and is head of the Genome Biology Dept. since 2009. She is a recipient of two ERC advanced grants, an elected member of EMBO, Academia Europaea, Leopoldina and Fellow of the Royal Society (FRS). Her work uncovered different mechanisms of genome regulation, including how enhancers function and regulate developmental programmes.

QUOTE
"Metadata is what makes data trustworthy and reusable.” Science is rapidly becoming the science of data, opening the door to new capabilities like AlphaFold. Computing data and models is resource-intensive; our results should be shared with rich annotations so that everyone can build on them."