3.2 The FAIR Principles apply to both data and metadata

The 15 guiding FAIR Principles refer to the ‘(meta-)data’. What does that mean? On the one hand, they apply to the actual data elements (bit sequences). Examples of these include a text, a file, an image, source code of software, a 3D model, a service, or time series (audio signals). They can also refer to an aggregate of many units, which itself can be individually addressed, such as a database, a digitized book, the digitized materials of an estate, a recording system, software, or a research data publication with multiple components.

Metadata describes the properties of other data. It provides information about their content, properties, or structure, indicating the contexts in which they are found or how they can be used.

In the process of implementing the FAIR Principles, data and metadata take on different functions. The methods used to make data and the corresponding metadata FAIR can differ, and the degree of FAIRness can vary between the two components. Therefore, it is first necessary to identify the two components and structure them as digital objects (DO).

Digital objects as organisational units

A digital object is initially ‘an object composed of a sequence of bit sequences’. This means that any file can be considered a digital object. Some digital objects can be structured in a simple way, such as a text file. A video, which consists of multiple elements (video track, audio track, container file, and possibly others), can be considered a complex digital object.

For a digital object to be interpreted by machine agents, it must be addressable, structured, and typified. To achieve this, the bit sequence is assigned an identifier—preferably a globally unique and persistent identifier (PID) — along with a description of its properties in the form of a metadata unit. The bit sequence, PID, and metadata are linked through distinct relations to form an extended digital object, which can be addressed and processed as a unit of knowledge.
.

Fig. 1: Relationships between components of a digital object, its aggregator, and repository. Smedt, K. / Koureas, D. / Wittenburg, P.: FAIR Digital Objects for Science: From Data Pieces to Actionable Knowledge Units, in: Publications, 2020, 8. 21., fig. 4. CC-BY 4.0

In the field of memory institutions (libraries, archives, museums), the term ‘metadata’ is generally used for the supplementary information with which the mostly physical objects, collections, and resources kept there are organized, described, and managed. It is thus quite common for metadata to refer to non-digital entities, which may be material or immaterial, concrete or abstract. Instead of referring to a bit sequence, metadata can relate, for example, to a painting, a libretto, a source text, a person, a geographic location, an event, or a concept from the history of ideas. If authority data for these non-digital entities are available, they can be addressed via authority data PIDs. In this case, the digital object contains a bit sequence of a dataset in a structured format, which itself contains metadata about a non-digital entity.

To better understand the function of metadata, it is helpful to divide them into different categories. They all play a role in ensuring the FAIRness of the data.

  • Administrative metadata are data about a bit sequence or an object that are relevant for its management. This includes information about its type and association with a specific project context, such as project/resource owners, research leadership, project staff, funding, project duration, and the creation date of the bit sequence. This group of metadata also includes
    • technical metadata: Information required for the use and processing of files (file format, image resolution, compression rate, hardware and software requirements, authentication and security data such as passwords),
    • legal metadata: Information about intellectual property and usage rights,
    • provenance metadata: Information about the origin of data and modifications made to the bit sequence or metadata set itself. These name actors, timestamps, the type and methods of processing, or the underlying information sources.
    • archival metadata: Information necessary for the long-term management and preservation of digital assets. They ensure the integrity of a digital object throughout its retention period. A common model for this is PREMIS (Preservation Metadata: Implementation Strategies).
  • Descriptive metadata are data about a bit sequence or an object that enable human and machine agents to find, identify, and cite them. Examples include authorship or creation contexts, titles, subject keywords, persistent identifiers, related publications, and objects.
  • Structural metadata describe the internal organisation of complex digital objects or items and establish relationships between their components or to other digital objects. For example, through METS or TEI annotations, they can provide a digital table of contents for a digitized early modern 120-page print, thereby enabling the meaningful organisation of its subunits for subsequent use.

The metadata information relevant to the digital object can vary depending on the data format and the intended usage context. It is generally assumed that metadata can be found not only in the documentation created for them (e.g. in a database or table) but also at the level of software and system configurations or process control (e.g. log files), from which they must be extracted.

Example of a born-digital image file (bit sequence) and its metadata

Administrative Metadata

  • technical: file format, size, resolution, colour depth, compression rate
  • legal: licensing, usage rights, rights holder
  • provenance: information of origins of the original file (camera manufacturer and model, exposure time, geocoordinates of the location), modification history (conversion and image editing, specifying the software used, processes, formats, technical parameters, timestamps, and involved actors)

Descriptive Metadata

  • Photographer, time of capture, identification of the depicted entity (e.g. person, object, event, location), context, motivation for creating the image file, reference to other versions and usage contexts addressable via PID

The PID of the digital object typically leads to a landing page, an HTML page that displays metadata about the digital object. This page must provide enough information for both humans and machines to identify the digital object. Additionally, the landing page must offer access to the bit sequence itself (text, image, video), as well as to any other available metadata. Machine interpretability of the metadata can be ensured by embedding it into the header of the HTML page using schema.org.