Thursday, May 26, 2016

Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library

Given that Wikipedia, Wikidata, and the Biodiversity Heritage Library (BHL) all share the goal of making information free, open, and accessible, there seems to be a lot of potential for useful collaboration. Below I sketch out some ideas.

BHL as a source of references for Wikipedia

Wikipedia likes to have sources cited to support claims in its articles. BHL has a lot of articles that could be cited by Wikipedia articles. By adding these links, Wikipedia users get access to further details on the topic of interest. BHL also benefits from greater visibility resulting from visits from Wikipedia readers.

In the short term BHL could search Wikipedia for articles that could benefit from links to BHL (see below). In the long term as more and more BHL articles get DOIs this will become redundant as Wikipedia authors will discover articles via CrossRef.

There are various ways to search Wikipedia to get a sense of what links could be added. For example, you can search the Wikipedia API for pages that link to a particular web domain (see https://www.mediawiki.org/wiki/API:Lists/All#Exturlusage). Here's a search for articles linking to biostor.org https://en.wikipedia.org/w/api.php?action=query&list=exturlusage&euquery=biostor.org&eulimit=20.

A quick inspection suggests that many of these links could be improved (for example, some have outdated links to PDFs and not to the article), so we can locate Wikipedia articles that could be edited. It is likely that Wikipedia articles that have one link to BHL or BioStor may have other citations that could be linked.

Wikipedia as a source of content

One of the big challenges facing BHL is extracting articles from its content. My own BioStor is one approach to tackling this problem. BioStor takes citation details for articles and attempts to locate them in BHL - the limiting factor is access to good-quality citation data. Wikipedia is potentially an untapped source of citation data. Each page that uses the "Cite" template could be mined for citations, which in turn could be used to locate articles. Wikipedia pages using the Cite template can be found via the API, e.g. https://en.wikipedia.org/w/api.php?action=query&list=embeddedin&eititle=Template:Cite&eilimit=20&format=json. Alternatively, we could mine particular types of pages (e.g., those on taxa or taxonomists), or mine Wikispecies (which doesn't use the same citation formatting as Wikipedia).

Wikidata as a data store

If Wikidata aims to be a repository of all structured data relevant to Wikipedia, then this includes bibliographic citations (see WikiCite 2016 ), hence many articles in BHL will end up in Wikidata. This has some interesting implications, because Wikidata can model data with more fidelity than many other sources of bibliographic information. For example, it supports multiple languages as well as multiple representations of the sample language - the journal Acta Herpetologica Sinica https://www.wikidata.org/wiki/Q24159308 in Wikidata has not only the Chinese title (兩棲爬行動物學報) but the pinyin transliteration "Liangqi baxing dongwu yanjiu". Rather than attempt to replicate a community-editable database, Wikidata could be the place to manage article and journal-related metadata.

Disambiguating people

As we move from "strings to things" we need to associate names for things with identifiers for those things. I've touched on this already in Possible project: mapping authors to Wikipedia entries using lists of published works. Ideally each author in BHL would be associated with a globally unique identifier, such as ORCID or ISNI. Contributors to Wikipedia and Wikidata have been collecting these for individuals with Wikipedia articles. If those Wikipedia pages have links to BHL content then we can semi-automate the process of linking people to identifiers.

Caveats

There are a couple of potential "gotchas" concerning Wikipedia and BHL. The licenses used for content are different, BHL is typically CC-BY-NC whereas Wikipedia is CC-BY. The "non commercial" restriction used by BHL is a deal-breaker for sharing content such as page images with Wikicommons.

Wikipedia and Wikidata are communities, and I've often found this makes it challenging to find out how to get things done. Who do you contact to make a decsioon about some new feature you'd like to add? It's not at all obvious (unless you're a part of that community). Existing communities with accepted practices can be resistant to change, or may not be convinced that what you'd like to do is a benefit. For example, I think it would be great to have a Wikipedia page for each journal. Not everyone agrees with this, and one can expend a lot of energy debating the pros and cons. The last time I got seriously engaged with Wikipedia I ended up getting so frustrated I went off in a huff and built my own wiki. This is where a "Wikipedian in residence" might be helpful.