Science cluster
Summary
Rucio is an open source data management solution used by different nuclear, particle and astrophysics experiments, such as the LHC at CERN (ATLAS, CMS), AMS, DUNE, Belle II, ICARUS, LIGO/VIRGO, CTA, MAGIC and Rubin LSST, and is a key component of the ESCAPE Science Cluster. Exabytes of scientific (raw) data are managed with this technology and transferred worldwide for efficient distribution of data sets that need to reach end-users. However, existing workflows often require duplicating data to third-party systems to make it openly accessible. The project Streamlining open data policies in Rucio data management platform aims to enhance Rucio by embedding open data policies directly into its core, enabling seamless data sharing without duplication, and supporting interdisciplinary research.
Challenge
Open Science project, Open Science Service, Citizen science, Cross-domain/Cross-RI
The growing need for interdisciplinary research, particularly in multi-messenger science, has highlighted the importance of sharing scientific data across institutions performing complementary research (infrared, gravitational waves, etc.). However, institutions often rely on archaic or error-prone methods for data sharing. Moreover, given that Rucio does not recognise open data as a main data type, experiments need to copy this data to other systems to make them FAIR, incurring extra costs for the storage of the copy of the data. The main objective of this project is thus to introduce native support in Rucio to manage open data, where embargo and public access policies can be defined, so a copy of the data is not needed to make data open.
Solution
The project proposes a paradigm shift: instead of duplicating data, Rucio will link open data directly at the source, incorporating FAIR principles natively into its architecture. By managing embargoes and public access policies within Rucio itself, scientific data can remain within its custodial storage while being accessible as open data. This approach not only prevents unnecessary duplication but also ensures long-term preservation using Rucio’s robust replication mechanisms.
Scientific Impact
The integration of open data support within Rucio will benefit a wide range of research infrastructures, from particle physics to astronomy and beyond. The project will reduce costs, improve resource efficiency, and lower the environmental footprint by minimising data duplication. Moreover, it will foster collaboration across institutions and disciplines, empowering the broader scientific community, including citizen scientists, to access and exploit data more easily, further advancing scientific progress.
Principal investigator
Hugo Gonzalez is a Storage Engineer working at CERN's Storage and Data Management. Hugo has more than 10 years of experience in diverse software defined storage platforms.
His role includes operating and managing vast distributed storage systems that store over an exabyte of data.
He coordinates the Rucio project activities inside the CERN IT department and is a core member of the Rucio project.
Prior to joining CERN, he was working as Software Engineer for ESA's HUMSAT project.
He is an enthusiast of open source technologies and his preferred programming language is Go.