I2. (Meta)data use vocabularies that follow FAIR principles

Use open, well-defined vocabularies

Fundamental to knowledge organisation systems are concepts. They are to be understood as "units of thought" (concepts) – ideas, meanings, or (categories of) objects and events. As such, concepts exist in consciousness as abstract entities that are independent of the terms by which they are labelled.

A controlled vocabulary is a system for organising knowledge that contains a structured set of concepts for classifying and organising data, ensuring it can be accessed and searched for later. These concepts relate to data descriptors that are connected by explicit relationships (hierarchical or associative). These descriptors are used to distinguish and define the characteristics of knowledge resources in a specific domain. They contain data values for general terms, individual names, and other values necessary for the structured description of data.

With the help of controlled vocabularies, resources can be queried, searched, analysed, and linked to other relevant information objects.

The most common types of controlled vocabularies are:

  • Thesaurus – an organised compilation of terms and their (predominantly natural language) designations, which serve to index, store, and retrieve information in a documentation field.
  • Classification – a system primarily based on classifying things or concepts into groups or classes, with a detailed explanation of these classification methods.
  • Application ontology – a classification system for data in order to categorise and represent it in a machine-interpretable way. A formally organised representation of a set of terms and the relationships between them in a specific domain, which can be derived from a reference model.
  • Keyword list – a list of terms used to describe topics in an information system.
  • Taxonomy– a system that organises things and terms into groups based on their common characteristics and/or differences.
  • Word list – a list of designations used to describe terms in a specific field.
  • Glossary – an alphabetical list of terms with their definitions, used in a particular context.
  • Authority data – a dataset that describes individual entities (instances of specific classes), such as persons, organisations, geographical locations, and works, to the extent that they are uniquely identifiable and referable.

In practice, no clear distinction is often made between these types when designating the vocabularies. They are therefore increasingly being referred to as 'semantic artefacts'.

The selected vocabularies should have PIDs for their entities and be accessible, interoperable, and carefully documented, thus also being FAIR. Use openly licensed, well-developed, and published vocabularies that are widely recognized in the professional community. The integration or referencing of such vocabularies provides the precise meanings of the concepts and properties represented in the data.

Terminology Services 4 NFDI (TS4NFDI), one of the NFDI's basic services currently under development, is a cross-domain service for providing, curating, developing, harmonising and mapping terminologies. Its goal is to support the convergence of individual solutions and integrate them into a standardised, interoperable and sustainable suite of services.

Here you may find Recommendations for vocabularies, authority data, and application ontologies in the fields of cultural research and cultural heritage.

The role of data producers

Use vocabularies relevant to your field and organise your research results accordingly from the beginning of your project. Always integrate the PIDs (as URIs) of the terms alongside their language labels into your data to ensure unambiguity, even when the data is analysed automatically, and to enable use in the context of Linked Data applications.

Please note that the URI of a term often does not match the URL of the website. For example, the URI for the city of Potsdam in GeoNames is https://sws.geonames.org/2852458/, while the URL of the website is https://www.geonames.org/2852458/potsdam.html. References to the URIs can be found via menu items such as ‘semantic view’, ‘Concept ID’, and the RDF or JSON-LD formats available.

Often, the use of widely accepted and published vocabularies within the community is only partially possible because terms are needed that are not yet covered in a known published vocabulary or ontology. Examples include insufficient offers for objects or visual content from non-European cultures or the lack of specific technical terms, e.g. for historical glass processing techniques in the Art and Architecture Thesaurus. The GND and Getty vocabularies allow their user communities to supplement the vocabulary in line with their editorial rules.

You can also create a project-specific vocabulary and publish it under an open license, preferably as Linked Open Data in a machine-readable format. Use specialised tools for thesaurus creation and publication, such as VocBench (open source), ACDH Vocabs Editor (MIT License), xTree or vocabulary modules from collection management software, as well as the editor Protégé for ontology modelling. The vocabulary can be published using tools like Skosmos an open-licensed, web-based SKOS browser. This purpose is also met by SkoHub, which offers additional vocabulary management functions.

The self-created vocabulary should, as much as possible, follow the structure of a published vocabulary and be designed as a local extension of it.

The vocabulary or ontology that applies to a specific data element should be clearly specified. Even for elements where this does not apply, the value type of the element should be clearly defined in the metadata of the digital object using a publicly available vocabulary or ontology.

The role of data platform operators

Provide examples of vocabularies that can be used by the expert communities you represent and that can be addressed via the platform's interfaces.

Wherever possible, enable the use of widely accepted authority data or identification systems, such as authority data for persons (ORCID), organisations (ROR), funding organisations (Crossref Funder Registry), DFG Classification of Scientific Disciplines, GND etc.

Ensure that the relevant attributes are stored in the metadata to guarantee unambiguity and machine readability.

Further information on FAIR vocabularies

Hugo, Wim / Le Franc, Yann / Coen, Gerard et al.: D2.5 FAIR Semantics Recommendations Second Iteration (1.0), 2020

Corcho, Oscar / Ekaputra, Fajar J. / Heibi, Ivan et al.: A maturity model for catalogues of semantic artefacts, Scientific Data 11, 479, 2024

Cox, Simon J. D. / Gonzalez-Beltran, Alejandra N. / Magagna, Barbara et al.: Ten simple rules for making a vocabulary FAIR, PLoS Computational Biology 17 (6): e1009041, 2021

Xu, Fuqi / Juty, Nick / Goble, Carole et al.: Features of a FAIR vocabulary, Journal of Biomedical Semantics, 14, 6, 2023

Zaytseva, Ksenia / Durčo, Matej: Controlled Vocabularies and SKOS. Version 1.1.0. Edited by Matej Ďurčo and Tanja Wissik. DARIAH-Campus, [Training module], 2020

Further information on contributing to reference vocabularies

Gemeinsame Normdatei (Integrated Authority File)

Getty vocabularies

Example: Creation of a project-specific vocabulary with LOD publication

Project "Digitization of Gandharan Artefacts (DiGA)": Elwert, Frederik / Pons, Jessie: Brücken bauen für Buddha - Das Projekt "Digitalisierung Gandharischer Artefakte" (DiGA) und die Pelagios Working Group "Linked Data Methodologies in Gandharan Buddhist Art and Texts", in: DHd 2022 Kulturen des digitalen Gedächtnisses. 8. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" (DHd 2022), Potsdam, 7. März 2022

Amato, Antonio / Elwert, Frederik / Pons, Jessie: Digitization of Gandharan Artefacts: A Project for the Preservation and the Study of the Buddhist Art of Pakistan. A Digitization Concept, 2022

DiGA Thesaurus on Github and Skosmos