R1.2 (Meta)data are associated with detailed provenance

In this context, ‘provenance’ does not refer to the ownership and origin history of the objects described in the data, but rather to the data and metadata themselves. Provenance information specifies the purpose behind the generation of the data, the research questions that were addressed, and the source of the data material. It also explains how the data were modified, in what context they can be reused, and how reliable they are. Well-documented data provenance ensures transparency and traceability with regard to the history and context of the dataset, promoting trust, reproducibility, and its appropriate use.

Therefore, include the following aspects in your metadata or the published accompanying documentation of your project:

  • Naming all individuals involved in the data processing, including their roles and contributions, as far as compliant with data protection regulations
  • Description of the motivation that led to the creation of the data (project goals and priorities)
  • Description of the methods applied
  • Description of the workflow that led to the data generation: Was the data machine-generated or manually/intellectually created? What was its subsequent processing?
  • If data from other sources was used: description of the origin, its use, and modification
  • Software used for data generation, processing, or viewing: scripts, program code, DTDs, or XML schema files, including version information
  • Data models, formats, ontologies, cataloguing, and editorial guidelines used, in machine-readable form, e.g. as XML, DTDs, or XML schema files, including version information
  • Creation and modification timestamps of the data and metadata
  • Source references for statements and information that were adopted

For dynamic data, which can be continuously updated, it is desirable to keep earlier versions available and addressable via a PID to ensure the citability of the dataset's content. Also, publish any software created in the project context under a licence that is as open as possible, for example via GitHub.

Further information on data documentation

Böker, E.: Warum dokumentieren?, in: Forschungsdaten.info, 2024/04/25

Edig, Xenia van / Dellmann, Sarah / Renziehausen, Anna-Katharina et al.: Data Papers – An Ode to Data. in: TIB Blog, 2022/02/15

Alkemade, Henk / Claeyssens, Steven / Colavizza, Giovanni et al.: Datasheets for Digital Cultural Heritage Datasets, in: Journal of Open Humanities Data, vol. 9, 2023

European Commission: Directorate-General for Research and Innovation, EOSC Executive Board, Corcho, Oscar / Eriksson, Magnus et al., EOSC interoperability framework – Report from the EOSC Executive Board Working Groups FAIR and Architecture, Publications Office, 2021, p. 51-56

Middle, Sarah: A documentation checklist for (Linked) humanities data, in: International Journal of Digital Humanities 5, 2023, p. 353–371

Follow naming conventions

Adhering to precise and consistent naming conventions — commonly recognised patterns for naming data — greatly facilitates future generations of researchers in finding, accessing, and understanding digital objects and datasets. The German Network of Educational Research Data provides guidance on naming and organising files.

The role of data producers

Consult the guidelines of best practices for your research discipline or field to find the most suitable naming convention. Start implementing it from the beginning of your project and follow the conventions consistently from the outset.

The role of data platform operators

Provide best practices for creating and applying specific naming conventions.

Systematically document data

To clarify what can and cannot be expected in a dataset, the data should be systematically documented. This transparency promotes trust and therefore the reuse of the data.

The role of data producers

Incorporate detailed provenance information into your research data and publish it alongside the data.

If your repository asks you to provide a meaningful description of your dataset with metadata, carefully and thoroughly complete the provided form. Do this even for sections that are marked as non-mandatory or optional.

The role of data platform operators

Provide documentation templates that include descriptions of the layout, structure, and content of datasets, facilitating the creation of the accompanying documentation mentioned at the beginning of this section. Additionally, templates can be provided for methodology, a list of abbreviations, descriptions of data gaps, the database structure, etc. These help data contributors systematically document the data provenance.

Preserving data integrity

The collected research data should be identical to the research data made accessible later. To ensure data reliability, checks for data integrity should be carried out.

The role of data producers

Implement a version control method. You can often use certain features of your software for this. Ensuring that every change in a revised version of a dataset is properly documented is essential for the authenticity of each dataset.

The role of data platform operators

To determine whether a file has been altered, it is crucial to track provenance within the scope of version control — the origin of the data and all changes made over time — and compare every copy with the original. A check for data integrity can be performed using a validation mark, such as a checksum, or by directly comparing two files. A mechanism should be in place for handling different versions, such as adding a version component to the identifier as a search parameter.

Example of version control

DOI versioning procedure at Zenodo