3.1 Both humans and machines are intended as ‘data analysers’

The amount of publications in the digital space is steadily increasing, along with a wide range of contemporary digital tools as well as processing and analysis methods that enable new ways of discovering and analysing relevant data sets. As a result, humans are increasingly reliant on computer support to find and select the data most relevant to them. The FAIR Principles therefore aim to ensure optimal reusability for both humans and machines alike. ‘Machine-readability’ here refers to the ability of computer systems to find, access, integrate, and reuse data with little to no human intervention.

A digital object is ‘machine-actionable’ when it provides information that enables an autonomously operating, computer-assisted algorithm to

  • identify the type of the digital object, both in terms of its structure and its intended purpose,
  • determine whether it is useful in the context of the current task, by querying metadata and/or data elements (bit sequences),
  • determine whether it is usable in accordance with the license or other accessibility or usage restrictions,
  • reuse it in a similar manner as a human would with a comparable inquiry, and
  • carefully document the provenance of the data in order to properly cite the collected data.

Good research data management in accordance with the FAIR Principles enables a network of data and services that can find each other, communicate, and remain available for reuse. This is fundamentally based on Linked Data technologies. These technologies rely on the global, unique identification of digital objects and their associated resources through Uniform Resource Identifiers (URIs). This is the prerequisite for classifying and categorizing them via links to identification systems (e.g. ontologies), allowing them to be 'understood' by machines. Similarly, relationships between resources are indicated using URIs. The leading overarching standards for encoding semantics are RDF (Resource Description Framework), which defines the syntax for data exchange, OWL (Web Ontology Language), a formal description language for creating, publishing, and exchanging ontologies, and SKOS (Simple Knowledge Organization System), a formal language for encoding documentation languages such as thesauri, classifications, or other controlled vocabularies. The embedded semantics, connected with the data, offers significant advantages for the qualified evaluation of the data and handling of data sources with heterogeneous content.

The role of data standards

The primary function of data standards is to make information more analysable through consistent encoding or uniformity in description. Their effectiveness is not rooted in their formal prescription but rather in their shared use. It is the most widely adopted systems and conventions within a user community that determine what types of information are recorded for each information object in a data collection and in what manner. These standards are typically well-documented and have active user communities that continuously work on their content and technical development, as well as on the software systems that adapt them to current and future challenges. An example of such an adaptation is the document format Text Encoding Initiative (TEI), which, originating from its initial application in specialized research libraries, is now used internationally in a wide variety of text edition projects across diverse disciplines.

The consistent use of standards is particularly important when creating metadata, including its sources and conditions of their production. This is a prerequisite for making metadata analysable by machines. Standards ensure the medium- and long-term comprehensibility and reusability of data, enabling people to work with data they did not create themselves. If rarely used or poorly documented formats, schemas, or models are employed, or if the data is embedded in software that is proprietary or no longer accessible due to lack of maintenance or documentation, the comprehensibility of the data is often no longer guaranteed, even for humans, after just a few years.

The consistent use of standards is especially important when creating metadata, including their sources and conditions of collection. This is a prerequisite for making the metadata analysable by machines. Standards ensure the medium- and long-term comprehensibility and reusability of data, allowing people to work with data they did not create themselves. When rarely used or poorly documented formats, schemas, or models are used, or when data is embedded in software that is proprietary or no longer accessible due to lack of maintenance or documentation, the comprehensibility of the data is often no longer guaranteed, even for humans, after just a few years.

Which standards should be followed depends on the practices of the specific domain in handling the respective type of material, the documentation goals of the research project, and the type of data being generated. Collection cataloguing projects in cultural heritage institutions usually follow what is widely used in their respective sector — whether library, archive, museum, or monument preservation. For digitisation projects in institutions with collections, the DFG Practical Guidelines on Digitisation have, for many years, served as a widely recognised and proven good practice recommendation for quality assurance through the use of standards, extending far beyond the original context of DFG grant applications. The latest version of these guidelines was released in 2023.

With regard to the future reuse of research data, data producers should always consider standards that may not yet be widely used in a particular domain but seem appropriate for the research question and hold great potential for good interoperability and reusability of the data. This is especially true for vocabulary standards.

The data standards relevant to research data pertain to several areas of application. Recommendations for the fields represented by NFDI4Culture can be found on the linked pages within this guide.

  • Standard file formats ensure to a large extent that files can be used at a later time and by other actors. Here you may find recommendations for file formats.
  • Reference models (top-level ontologies) are complex sets of representation elements (classes, attributes, and relations) used to model a domain of knowledge or discourse. The definitions of these elements provide information about their meaning and the restrictions that ensure their application is logically consistent. Reference models enable machines to ‘understand’ metadata by allowing new facts to be inferred from existing data, based on rules of reasoning. Here you may find recommendations for frameworks and reference models.
  • Metadata standards specify how documentation should be handled within specific professional communities, setting out the criteria, methods and processes to facilitate the shared use of data. Depending on the purpose for which they were developed, they are divided into three areas:
    • Data content standards or cataloguing rules specify how certain resources, objects, or facts should be described. They recommend how information should be structured and formatted, and which vocabulary should be used. Here you may find recommendations for metadata content.
    • Data structure standards refer to metadata element sets. They define the structure, content, semantics, and scope of metadata in the form of a category schema. When available in the form of a metadata schema, they can also be processed technically. Here you may find recommendations for metadata structures.
    • Standards for data values refer to controlled vocabularies, whose use as data values in the elements of the metadata schema is recommended. Controlled vocabularies ensure that concepts (‘units of thought’) are defined, provided with clear labels and unique identifiers, and relatable to each other. Their use is a prerequisite for complete and accurate search results and the correct linkage of comparable data. Here you may find recommendations for vocabularies, authority files, and application ontologies.