In this context, ‘provenance’ does not refer to the ownership and origin history of the objects described in the data, but rather to the data and metadata themselves. Provenance information specifies the purpose behind the generation of the data, the research questions that were addressed, and the source of the data material. It also explains how the data were modified, in what context they can be reused, and how reliable they are. Well-documented data provenance ensures transparency and traceability with regard to the history and context of the dataset, promoting trust, reproducibility, and its appropriate use.
Therefore, include the following aspects in your metadata or the published accompanying documentation of your project:
For dynamic data, which can be continuously updated, it is desirable to keep earlier versions available and addressable via a PID to ensure the citability of the dataset's content. Also, publish any software created in the project context under a licence that is as open as possible, for example via GitHub.
Böker, E.: Warum dokumentieren?, in: Forschungsdaten.info, 2024/04/25
Adhering to precise and consistent naming conventions — commonly recognised patterns for naming data — greatly facilitates future generations of researchers in finding, accessing, and understanding digital objects and datasets. The German Network of Educational Research Data provides guidance on naming and organising files.
Consult the guidelines of best practices for your research discipline or field to find the most suitable naming convention. Start implementing it from the beginning of your project and follow the conventions consistently from the outset.
Provide best practices for creating and applying specific naming conventions.
To clarify what can and cannot be expected in a dataset, the data should be systematically documented. This transparency promotes trust and therefore the reuse of the data.
Incorporate detailed provenance information into your research data and publish it alongside the data.
If your repository asks you to provide a meaningful description of your dataset with metadata, carefully and thoroughly complete the provided form. Do this even for sections that are marked as non-mandatory or optional.
Provide documentation templates that include descriptions of the layout, structure, and content of datasets, facilitating the creation of the accompanying documentation mentioned at the beginning of this section. Additionally, templates can be provided for methodology, a list of abbreviations, descriptions of data gaps, the database structure, etc. These help data contributors systematically document the data provenance.
The collected research data should be identical to the research data made accessible later. To ensure data reliability, checks for data integrity should be carried out.
Implement a version control method. You can often use certain features of your software for this. Ensuring that every change in a revised version of a dataset is properly documented is essential for the authenticity of each dataset.
To determine whether a file has been altered, it is crucial to track provenance within the scope of version control — the origin of the data and all changes made over time — and compare every copy with the original. A check for data integrity can be performed using a validation mark, such as a checksum, or by directly comparing two files. A mechanism should be in place for handling different versions, such as adding a version component to the identifier as a search parameter.
DOI versioning procedure at Zenodo