This web service can be used to align datasets to wiki items in OpenRefine.
Use the following URL in OpenRefine: https://wiki.kul.pl/reconcile/en/api.
Use the following URL to access the manifest.json: https://wiki.kul.pl/reconcile/static/manifest.json.
Replacing "en" by another language code will display items and properties in your language, when they are available.
This interface works with OpenRefine from 2.6 rc2 onwards. It is not compatible with Google Refine.
Contents:
If a unique identifier is supplied in an additional property, then reconciliation candidates will first be searched for using the unique identifier values. If no item matches the unique identifier supplied, then the reconciliation service falls back on regular search.
If you only have unique identifiers and no names for the entities you want to reconcile, you can therefore supply a fake column of names (for instance using a random value that yields no match when searching for it in Wikidata.
Sometimes, the relation between the reconciled item and the disambiguating column is not direct: it is not represented as a property itself. Let us consider this dataset of cities:
To fetch the country code from an item representing a city, you need to follow two properties. First, follow country (P17) to get to the item for the country in which this city is located, then follow ISO 3166-1 alpha-2 code (P297) to get the two-letter code string.
This is supported by the reconciliation interface, with a syntax inspired by SPARQL property paths: just type the sequence of property identifiers separated by slashes, such as P17/P297
:
This additional information allows to distinguish between namesakes, to some extent. As "Cambridge, US" is still ambiguous, there are multiple items with a perfect matching score, but "Oxford, GB" successfully disambiguates one particular city from its namesakes:
The endpoint currently supports two property combinators: /
, to concatenate two paths as above, and |
, to compute the union of the values yielded by two paths. Concatenation /
has precedence over disjunction |
. The dot character
.
can be
used to denote the empty path. For instance, the following property paths are equivalent:
P17|P749/P17
P17|(P749/P17)
(.|P749)/P17
They fetch the country (P17) of an item or that of its parent organization (P749).
Labels, aliases and descriptions can be accessed as follows (L for label , D for description, A for aliases, S for sitelink):
Len
for Label in EnglishDfi
for Description in FinnishApt
for Alias in PortugueseSdewiki
for Sitelink in German Wikipedia page titles. For an overview of all Sitelinks ids of Wikidata see: MediaWiki API result.The lowercase letters are Wikimedia language codes which select which language the terms will be fetched. No language fall-back is performed when retrieving the values.
By default, the values supplied in OpenRefine and the ones present in Wikidata are compared by string fuzzy-matching. There are some exceptions to this:
Sometimes, we need a more specific matching on sub-parts of these values. It is possible to select these parts for matching by appending a modifier at the end of the property path:
@lat
and @lng
: latitude and longitude of geographical coordinates (float)@year
, @month
, @day
: parts of a time value (int). They are returned only if the precision of the Wikidata value is good enough to define them.@isodate
returns a date in the ISO format 1987-08-23
(string). A value is always returned.
@iso
returns the date and time in the ISO format 1996-03-17T04:15:00+00:00
. A value is always returned. For times and dates, all values are returned in the UTC time zone.@urlscheme
("https"), @netloc
("www.wikidata.org") and @urlpath
("/wiki/Q42") can be used to perform exact matching on parts of URLs.For instance, if you want to refine people by their birth dates, but you only have the month and day. First, split the birthday dates in two columns, for month and day.