2.2. OpenRefine

Project proposal context

OpenRefine is a free data wrangling tool that can be used to clean tabular data, reconcile data entities (i. e. identify matching entities across data services) and connect these with external knowledge bases. It is a community-supported open source project (licensed under the BSD license). OpenRefine is used by diverse communities including: librarians, researchers, data scientists, and the NFDI4Culture community, too. It is also used in Task Areas 1 and 5 as part of the data enrichment and semantic infrastructure services offered by NFDI4Culture. In 2022, a small Flex Funds Tools grant from NFDI4Culture supported enabling extended connectivity between OpenRefine and the linked open data tool suite Wikibase, in particular with regards to developing a reconciliation service that can work with media files, too, not only text-based data.

In the course of working on extending OpenRefine’s capabilities, the OpenRefine team carried out user testing sessions and were able to identify a number of improvements to the reconciliation process that can significantly benefit the overall user experience. These concern: 1) how users interact with the reconciliation dialog window in OpenRefine; 2) how the interface displays reconciliation results from different services, including Wikidata, Wikibase, but also other standard terminology services such as the GND, Getty Vocabularies, VIAF and more; and 3) how users perform data enrichment on their own data via externally linked services. Work towards achieving these improvements was supported by a renewed Flex Funds Tools grant in 2023. A complete overview of related issues that were completed during the scope of the grant or remain under discussion and continuous development can be reviewed in this GitHub Project.

Deliverables

1) Redesign and redevelopment of the reconciliation service dialog interface

This deliverable has been completed to the stage of mockup designs for all parts of the dialog interface, iteratively refined through community discussions. The following interface design improvements have been completed and originally released as part of OpenRefine v3.8-beta1 (and now also in the stable version 3.8.0):

  • Improvements to the interface to select a reconciliation service - i.e. step one in the reconciliation service dialog (Github ticket: #6118).
  • Documentation of the reconciliation service is displayed in the reconciliation dialog (if available) (#5784).
  • The waiting screen displayed while guessing reconciliation types is internal to the reconciliation dialog (#4877).
  • The default types supplied by the reconciliation service are always offered to users (#4224).
  • The reconciliation types are displayed with both name and id (#5907).
  • Property selection in the reconciliation dialog gives better feedback to the user about whether a column is successfully mapped to a property or not (#6060, by @elebitzero).
  • Type selection is similarly improved (#6131, by @elebitzero).

2) Improving reconciliation and enrichment workflow visualization

After several design iterations, the following improvements have been implemented and released as part of OpenRefine v3.8-beta1:

  • It is now possible to discover the source of a column obtained by fetching (i.e. enriching) data from a specific reconciliation service, by hovering over the column header (#5130).
  • Following reconciliation only up to three reconciliation candidates are displayed by default, with the option to see more (#6154).
  • A new operation to extract URLs for reconciled cells is available (#5960).

3) Improving error messaging

A large part of development work also focused on providing more informative and actionable error messages during various stages of the reconciliation/enrichment workflows, which previously remained invisible to users and meant reconciliation was a much slower and inefficient process. The following improvements have been implemented and released as part of OpenRefine v3.8-beta1:

  • Errors encountered by the reconciliation operation are displayed in the corresponding item cells and are available via the cell.recon.error GREL expression (#3194).
  • Those errors can also be isolated via facets (#6232).
  • The "Search for match" option is present in cells with reconciliation errors so that they can be fixed manually (#6192).
  • The error messages generated during reconciling are more helpful (#6111).

Future developments related to the ongoing improvements of the reconciliation dialog design (Deliverable 1) and the enrichment visualizations (Deliverable 2) can be tracked via the GitHub Project. In particular, the inclusion of service logos in column headers is almost fully implemented (#6156) and can be expected to be part of the next release (3.9).

Lozana Rossenova, Antonin Delpeuch