wiki reconciliation for OpenRefine

This web service can be used to align datasets to wiki items in OpenRefine.

Use the following URL in OpenRefine: https://wiki.kul.pl/reconcile/en/api.

Use the following URL to access the manifest.json: https://wiki.kul.pl/reconcile/static/manifest.json.

Replacing "en" by another language code will display items and properties in your language, when they are available.

This interface works with OpenRefine from 2.6 rc2 onwards. It is not compatible with Google Refine.

Tutorials for OpenRefine users

Videos

Documentation

Contents:

Reconciling via unique identifiers

If a unique identifier is supplied in an additional property, then reconciliation candidates will first be searched for using the unique identifier values. If no item matches the unique identifier supplied, then the reconciliation service falls back on regular search.

If you only have unique identifiers and no names for the entities you want to reconcile, you can therefore supply a fake column of names (for instance using a random value that yields no match when searching for it in Wikidata.

Property paths

Sometimes, the relation between the reconciled item and the disambiguating column is not direct: it is not represented as a property itself. Let us consider this dataset of cities:

dataset with two columns, one with city names and the other with country codes of the countries where the cities are located

To fetch the country code from an item representing a city, you need to follow two properties. First, follow country (P17) to get to the item for the country in which this city is located, then follow ISO 3166-1 alpha-2 code (P297) to get the two-letter code string.

graph with three nodes: first, the item Cambridge, linked to the middle node United Kingdom via P17, which is finally linked to the third node 'GB' via P297

This is supported by the reconciliation interface, with a syntax inspired by SPARQL property paths: just type the sequence of property identifiers separated by slashes, such as P17/P297:

screenshot of the reconciliation dialog with a property path

This additional information allows to distinguish between namesakes, to some extent. As "Cambridge, US" is still ambiguous, there are multiple items with a perfect matching score, but "Oxford, GB" successfully disambiguates one particular city from its namesakes:

Screenshot of reconciled project state after use of SPARQL property paths

The endpoint currently supports two property combinators: /, to concatenate two paths as above, and |, to compute the union of the values yielded by two paths. Concatenation / has precedence over disjunction |. The dot character . can be used to denote the empty path. For instance, the following property paths are equivalent:

They fetch the country (P17) of an item or that of its parent organization (P749).

Special properties

Labels, aliases and descriptions can be accessed as follows (L for label , D for description, A for aliases, S for sitelink):

The lowercase letters are Wikimedia language codes which select which language the terms will be fetched. No language fall-back is performed when retrieving the values.

Subfields

By default, the values supplied in OpenRefine and the ones present in Wikidata are compared by string fuzzy-matching. There are some exceptions to this:

Sometimes, we need a more specific matching on sub-parts of these values. It is possible to select these parts for matching by appending a modifier at the end of the property path:

For instance, if you want to refine people by their birth dates, but you only have the month and day. First, split the birthday dates in two columns, for month and day.