1. Introduction and workflow

Many contributors in the domain of NFDI4Culture provide websites or online databases, but dedicated REST APIs or SPARQL endpoints are still rare. For the federated acquisition of data for the Culture Knowledge Graph we propose an easy to use, lightweight interchange format based on schema.org for the harvesting of resources from data collections with key attributes (IRIs, names, dates and terms from controlled vocabularies). The Culture Graph Interchange Format (CGIF) has the added benefit of automatically making the data eligible for Google Dataset Search and to significantly improve the findability of websites and datasets through search engine optimization.

Benefits

  • The CGIF is embeddable in web frontends of existing online databases (either as JSON-LD or RDFa Lite). Such frontends often contain all necessary data for harvesting, including permalinks/IRIs, dates, and values from authority files. Providers only need to adapt their HTML output - no need to implement dedicated APIs or REST endpoints.
  • CGIF Resources can also be provided in (generated) text files in any RDF serialisation (JSON-LD, TTL, N-TRIPLES).
  • Contributors may provide for full collections, partial collections (even search results if this is the only form of list view an online database offers) or single resources.
  • There already exists a good validator that partners can use for validating their embedded CGIF before submission to NFDI4Culture.
  • There is no need for complex transformations or mappings of incoming data to the Culture Knowledge Graph if partners use Culture IRIs to indicate their contributions. CGIF is compatible with the data structure of the Culture Knowledge Graph, as described by the NFDI4Culture Ontology (CTO).
  • Implementing the CGIF as embedded metadata raises the SEO level of online databases and thus improves findability and interoperability. Following the CGIF specification, partners also have the option to include their databases in Google Dataset Search. Google provides detailed instructions on how to further improve the display of results.
  • The CGIF format also makes it possible to harvest resources from Wikibase instances (through an extension like WikiSEO) and any other web-based system generating LOD.

Workflow

  1. You notify the helpdesk about your data feed via the NFDI4Culture portal as a "General request." The team will notify you about any further identifiers you need.
  2. You implement a CGIF schema:DataFeed according to this specification, either in the frontend of a web database or as dedicated files.
  3. You validate the schema:DataFeed via the schema.org validator.
  4. You notify us about your entry-point URL or file dump.
  5. Our CGIF crawler validates your schema:DataFeed. On first submission and if valid, a feed is always fully harvested.
  6. The NFDI4Culture portal stores metadata on each harvesting cycle. It also provides a scheduler for harvesting embedded metadata and checks the schema:dateModified property to see if a schema:DataFeed needs to be ingested again.
  7. All successfully harvested items of a feed are fetched and ingested into the Culture Knowledge Graph after being expanded in accordance with the NFDI4Culture Ontology (CTO).