This article is 1,014 words – a ten-minute read.
The BBMRI-ERIC team of researchers behind the Common Provenance Model have seen their work come to fruition with the launch of “ISO/TS 23494-1:2023 – Biotechnology — Provenance information model for biological material and data — Part 1: Design concepts and general requirements”. This marks a significant step forward in realising a provenance information model for biological material and data and requirements to support data interoperability and serialisation.
The main input for the standard development was based on the experience of the BBMRI and relates to the fact that a life cycle of a biological sample, the derived data and other research objects such as AI models is often distributed across various organisations.
In combination with reports with reproducibility and traceability issues in life sciences, the experience consequently led the team to identifying an urgent need for a standard for trustworthy, privacy-preserving, machine-actionable provenance information that would document the whole life cycle of a research object.
For samples and sample derived data, the life cycle starts with the sample acquisition, continues with the sample processing, sample analysis, and derived data generation, and ends with the data analysis. Additionally, the regarded research objects may include directly acquired data, such as clinical or measurement/observational data.
These issues drove the agenda of ISO TC 276 Working Group 5 for creating a new provenance information standard for biological material and data. The proposal highlighted the distributed nature of required provenance, as it may be generated and provided by heterogeneous organisations involved in biological material and derived data handling. The organisations include hospitals, analytical laboratories, biobanks, computational research centers, device providers, pharma suppliers, drug developers, etc.
The resulting provenance information standard aims for standardisation of provenance information in such complex, multi-organisational and heterogeneous environments. It addresses reproducibility issues in life sciences by supporting the traceability of data precursors. In addition, the standard takes into account the handling of sensitive provenance information, which is often related to patients’ diagnoses or donors of biological material or to transportation of pathogen samples, which belongs to common use cases of BBMRI.
Another BBMRI experience that became a strong motivator is that the effective quality (= fitness for purpose) assessment of the resulting datasets must take into account quality control of all the data precursors, as achieving partial acceptable quality of the sample life cycle is not enough (so-called garbage in garbage out situation). All the processes must be documented as a part of the quality assurance mechanisms and be clear who is responsible for the processes.
BBMRI is uniquely positioned to address quality aspects of samples, as it is often involved in various parts of the sample life cycle – BBMRI helps biobanks and other organisations comply with quality standards from a patient via a sample to data.
In this context, the provenance standard series will complement quality-related domain-specific standards that prescribe what information shall to be documented by prescription of how the information should be recorded in a machine-actionable way. The published standard is only the first of the published standards and provides general requirements on provenance information management. The way of recording domain-specific information in provenance is subject to the other standard in the series.
During the last decade, we have witnessed reproducibility issues in life sciences. There is strong evidence that the issues are caused by poor documentation of the samples and data generation techniques applied to the samples to generate data.
To enable reproducibility and effective quality assessment of related research objects, the traceability of all required information must be ensured in the first place. However, having provenance information models and required tools/services is not enough to address the issues, but proper provenance information management is required to support long-term sustainability of generated provenance and helps to preserve important properties of provenance, such as trustworthiness and reliability.
The published ISO 23494-1 standard will help address the reproducibility issues by setting up general provenance information management requirements. As a result, the standard will enable certification of organisations that manage provenance according to the current best practices, which on the one hand, may improve their competitive advantage on the market and, on the other hand, will enable regulators to require the certification, for instance for organisations to be funded by public sources.
The provenance information can be then used for programmatic discovery and querying of the fit-for-purpose research objects – samples, data, or models: ideally, the requirements on the research objects can be specified as constraints on the provenance information. Depending if parts of the provenance are exposed also to the discovery services, they can be used either in the discovery, or queried as a part of the access negotiation process.
Petr Holub, Jörg Geiger, and later Gianluigi Zanetti, initiated the development of the standard. Petr and Jörg lead the project for the whole 23494 standard series, while the development of the published ISO 23494-1 standard was led by Petr Holub, Jörg Geiger, and Rudolf Wittner.
The core development team of the standard includes experts from multiple European organisations, namely the University and the University Hospital of Würzburg; the Medical University of Graz; CRS4 – Center for Advanced Studies, Research and Development in Sardinia; and the Masaryk University.
The ISO standardisation process is also based on collaboration with industry experts who contribute to the standard development through their participation in the ISO Technical Committees.
The standard has been developed under the auspices of Technical Committee 276 ‘Biotechnology’. Collaboration with academic and research communities was established mainly through EOSC Life, as the Common Provenance Model, an open conceptual foundation for the standard series received development support through this EC funded project.
You can hear Petr and Rudolf talking about the provenance model in this episode of the BBMRI podcast.