Give Feedback

3. Compatible data formats and methods for data provision

The following chapter provides an overview of data formats and provisioning methods which the ingest pipeline currently ‘understands’. Routines to create and integrate data feeds into the Culture Knowledge Graph have been built by members of the NFDI4Culture community along with the Culture Knowledge Graph team. In addition, chapter 4 lists further formats and methods for data provision for which such routines are in development but not yet published.

Data formats

NFDIcore/CTO

Metadata on digital representations and related research data about material and immaterial cultural heritage may be provided directly encoded as RDF that adheres to the NFDIcore ontology and, more specifically, its NFDI4Culture Ontology (CTO) module. This is the target format used within the Culture Knowledge Graph. RDF can be provided in various serialisation formats, e.g. N-Triples, Turtle, RDF/XML.

Example: data feed of metadata about letters from the Gregorovius edition at the DHI Rome (edition, API)
Integration routine: direct integration

Further information on these ontologies

CGIF/schema.org

The Culture Graph Interchange Format (CGIF) is a small subset of schema.org. It is not as feature-rich as the target ontology used in the Culture Knowledge Graph, but may be an easier target to provide or convert your data to. CGIF-compatible data can be embedded in regular websites to make periodical harvests simple and improve your website’s search engine optimization (SEO) at the same time, because schema.org markup is used by many major search engines to understand website content and index it accordingly.

Example: sample image of the German Corpus Vitrearum (see page source)
Integration routine: Hydra Scraper

Further information on CGIF

LIDO XML

Lightweight Information Describing Objects (LIDO) is an XML standard that is well-established in cultural-heritage software. LIDO files describe individual museum or collection objects or object groups. The main challenge of transforming this format for the Culture Knowledge Graph is the extraction of IRIs since this is not a LIDO requirement, especially in case of the recordInfoLink to identify the object.

Example: sample LIDO of the Fontana del Moro
Integration routines: Hydra Scraper

Further information on LIDO

Methods for data provision

Retrievable data dump

A simple way to provide your data is by submitting it as a data dump in one of the supported data formats (see above). Instead of providing each file individually, those data dumps can be made available as single files in file formats such as JSON or in a ZIP archive available on the web. ZIP files/data dumps can be reloaded and their content reintegrated periodically. This allows you to rebuild your ZIP file/data dump whenever your data changes (or when there is not a lot of load on your server). If you are in need of a place to store your data dump, please get in contact with the NFDI4Culture Helpdesk.

Example: metadata record of the Detmolder Hoftheater data feed with link to data dump
Integration routine: direct integration (for RDF data), Hydra Scraper (for ZIP files)

Further information on data dumps

SPARQL endpoint

If your data is available via a SPARQL endpoint, implement so-called CONSTRUCT queries to match the entities and properties in your graph to the ones required by the NFDIcore/CTO ontology. Depending on the complexity of the query and the amount of data, however, the harvesting process may be simpler and less of a strain on your server if you run the query in a local environment and provide the Culture Knowledge Graph team with a link to the (periodically updated) data dump as outlined above.

Example: FactGrid
Integration routine: CONSTRUCT query created by data provider

Further information on SPARQL

Embedded metadata (CGIF/schema.org)

As outlined above, the Culture Graph Interchange Format (CGIF) is a lightweight data exchange format based on schema.org. It includes an option to provide data of an entire feed, including pagination. This allows for reliable and fast harvesting without straining a server: if your website has a list view of all feed elements, adding the required markup to this template may be a very simple way of adapting your research data for the Culture Knowledge Graph. Alternatively, CGIF/schema.org may also be provided via dedicated APIs.

Example: image archive of the German Corpus Vitrearum (embedded in page source), photographic collection of the Bibliotheca Hertziana (dedicated API)
Integration routine: Hydra Scraper

Further information on CGIF

List of URLs (BEACON-like)

Another starting point for harvesting your data can be a simple text file containing URLs of the files to transform and ingest. This can be useful if, for example, you have individual feed element files but no API or ZIP utility to bind them together. The list of URLs may, in theory, be similar to the Beacon format, but should focus on listing all resources you want to be harvested instead of just, for example, links to a single authority file.

Example: none so far
Integration routine: Hydra Scraper

Further information on Beacon-like files

Via an aggregator (DDB)

If you already passed on your research data to an aggregator, another harvesting option is to retrieve the aggregated data. An existing aggregator routine works with data stored in the Deutsche Digitale Bibliothek (DDB) and only requires knowledge of your provider identifier (ID) at the DDB. Harvesting these versions of your data, however, should only be your first choice if you are confident about their actuality, completeness, and quality.

Example: none so far
Integration routine: routine has not been published yet

Further information on the DDB

Via custom REST APIs

If your data is available via a custom REST API, you might find it worthwhile to look into (and adapt) one of the following routines that produce RDF data according to NFDIcore/CTO:

Example: RISM Online, Gregorovius letter edition, Musiconn.performance
Integration routine: RISM Online (Python harvester), Gregorovius letter edition (Jupyter Notebook), Musiconn.performance (Python harvester)

Further information on APIs

Wikipedia: APIs