Multi-lingual use of Wikidata

For several weeks I have been experimenting with a Gaidhlig version of the Scotland's Lost Places bot, hoping to exploit the multi-lingual labels on Wikidata.

The results on the dummy run haven't been inspiring, too often giving fallback to English words strung together in approximate Gaidhlig syntax.

A major problem is the sparse availability of Gaidhlig labels, both for the placenames and for the "terms of art" (taigh-tasgaigh / museum, etc.).

Regarding the placenames, often what is presented as the English label is actually a Gaidhlig name; batches of these for HES sites were identified visually then ported over as gd labels. Proper names (stations, football teams) are more problematic, lacking evidence for a Gaidhlig name and wanting to avoid creating a synthetic name; these are probably best excluded.

An alternative approach is to be stricter within the Wikidata queries themselves: only taking items which have a Gaidhlig label and where the item type also has a Gaidhlig label. This substantially reduces the potential population for the Bot's random selection but may be tidier. The population will then be inflected by the extent to which more gd labels are added to Wikidata. However, one of the positive features of the Scotland's Lost Places bot is to make evident what is lacking; after the bot publishes an item, it often triggers data improvements. Taking this approach for a Gaidhlig version would effectively be screening-out the improvement of underdeveloped item data.


Author: admin

Mastodon account where these were first posted: link