Monday, June 02, 2014

BioNames one year on

B8e253dc3be3d84f2c69c51b0af86c03 400x400It is almost a year to the day that I released BioNames, a database of "taxa, texts, and trees". This project was my entry in EOL's Computable Data Challenge. Since it went live (after much late night programming by myself and Ryan Schenk) I've been tweaking the interface, cleaning (so much cleaning), and adding data (mostly DOIs, links to BioStor, and PDFs). I also wrote a paper describing the project, published in PeerJ (http://dx.doi.org/10.7717/peerj.190).

Why BioNames?


I'm building BioNames to scratch a very specific itch. To me it is a source of enormous frustration that one of the most basic questions we can ask about a name (where was it first published?) is difficult to answer using current taxonomic databases. And if there is an answer, it is usually given as a text string describing the publication (i.e., a literature citation) rather than an identifier such as a DOI that enables me to (a) go to the publication, (b) refer to the publication in a database in an unambiguous way, and (c) discover further information about that publication by querying services that recognise that identifier.

There are enormous digitisation efforts underway by commercial publishers, digital archives, and libraries, and all of this is putting more and more literature online. This is the primary evidence base for taxonomy, it is where new names are published, taxa are described, and hypotheses of synonym and relationship are proposed, and we should be actively linking to it. Of course, there are some projects that do this, but these are typically restricted in taxonomic or geographic scope. I want all this information together in one place. Hence, BioNames.

Of course, I could wait until projects like ZooBank have all the animal names, but as I pointed out in Why the ICZN is in trouble, the ICZN and ZooBank have only a tiny fraction of the published names:
ICZN
This renders ZooBank barely usable for my purposes. There are millions of animal names in circulation, and our inability to discover much about them leads to all sorts of headaches, such as the errors in GBIF that I've mentioned earlier on this blog. I want a tool that can help me interpret those errors, and I want it now, hence BioNames.

What is in BioNames?


The original data comes from the LSID metadata served by ION. At the moment BioNames has 4,880,925 names, 1,549,152 of which are linked to a bibliographic citation. The bulk of the time I spend on BioNames consists of cleaning and clustering these citations, and linking them to digital identifiers.

To get some insight into what is left to be done I created a CSV dump of the publication data underlying BioNames, and loaded it into Google's Cloud Storage (http://storage.googleapis.com/ion-names/names3.csv). I then used Google's BigQuery to write some simple SQL queries. You can find more details here: https://github.com/rdmpage/bionames-bigquery.

Here is a summary table of the number of names that are published in an article with one of the identifiers that I track. These include DOIs, PMIDs, as well as whether the article is in BioStor, has a URL (typically to a publisher's web site), or a PDF.
IdentifierNumber of names
DOI196,915
BioStor130,792
JSTOR23,483
CiNii11,296
PMID8,886
URL72,754
PDF161,474
(any)489,029


The final row is the number of articles that have at least one identifier (some articles have multiple identifiers, such as a DOI and a link to BioStor). Given that there are approximate 1.5 million names with bibliographic citations, and around 490,000 have an identifier, the user as a 30% chance of finding the original description for an animal name picked at random. Obviously, BioNames has gaps (ION has missed a number of names, and/or publications), the taxonomic coverage of bibliographic identifiers is uneven (depending on the publications chosen by taxonomists to publish in, and the level of digitisation of those publications), and there is still a lot of data cleaning to do. But an almost 1 in 3 chance of finding something useful for a name seems a reasonable level of progress.

Out of interest I created some quick and dirty charts in Excel for different categories of identifier. Here, for example, is the percentage of names published each year that are linked to a publication with a DOI:
Doi
Over 80% of names published in 2013 were in an article with a DOI, so we are fast heading to a situation where modern zoological taxonomy is fully part of the citation graph of science. Much of this spike in 2013 is due to the adoption of DOIs by Zootaxa, which is far and away the dominant journal in animal taxonomy.

Here is the same chart for publications in BioStor.
Biostor graph
The big spike at the start is for names where the year of publication is missing. Leaving that aside, we can see the impact of the 1923 copyright cut-off in the US, which puts a big dent in the Biodiversity Heritage Library's digitisation efforts. Note, however, that BHL has a lot of post-1923 content.


Does anyone use BioNames?



I use BioNames almost every day, and have devoted way more time than is healthy to populating it. As I explore issues like the quality of the taxonomy in GBIF, I find it useful to see the original descriptions of a taxa, and its fate in subsequent revisions. In the early days I'd spend more time adding missing papers to help answer a question, but increasingly I'm finding that the content is already there. So, I find it useful, but what (gulp) if I'm the only one?

Below is the number of "sessions" per day since BioNames was launched (data from Google Analaytics for May 1st, 2013 to May 31st, 2014). After an initial flurry of interest, web traffic pretty quickly died off. Since then it's been slowly gaining more visitors, then (for reasons which escape me), it started getting a lot more traffic in April onwards:
Bionames
To give these numbers some context, for the same period BioStor (my archive of articles from BHL) had the following traffic:
Biostor
Note the different scales, BioStor is getting around 500 sessions a day during week days, BioNames gets around 200. By way of comparison, GBIF gets up to 4000 sessions a day, and this blog typically has 50-100 sessions per day.

Where next?


There are a couple of directions for the future. There is still a lot of data cleaning and linking to do. Last year I did a quick analysis of which taxonomic journals should be digitised next. I've updated this by creating a a spreadsheet that ranks the journals in BioNames by the number of names each has published, and each is coloured by the fraction of those names for which I've found a digital identifier for the paper in which they are published. This table is incomplete, and reflects not only the extent of digitisation, but also the extent to which I've managed to locate the journals online. But it is a starting point for thinking about what journals to prioritise for digitisation, or if they are already divitised, journals that I need to target for addition to BioNames. The spreadsheet is available as a Google sheet.

Another direction is data mining. In addition to the obvious task, naming locating and indexing taxonomic names, there are other things to be done. In BioStor I extract geographic point localities and specimen codes from the OCR text. These could be indexed to enable geographic or specimen-based searching. The same approach could be generalised to the literature in BioNames, so that we could track the mentions of a particular specimen, or retrieve lists of publications about a specific locality (e.g., all taxonomic papers that refer to a particular mountain range, deep sea vent, or island).

BioNames also does some limited analysis of taxonomic name co-ocurrence, for example suggesting that species names with the same specific epithet but different generic names are possible synonyms if they occur on the same page. There is a lot of scope for expanding this. I'm also keen to explore citation indexing, that is, extracting lists of literature cited from articles in BioNames, and linking those to the corresponding record in BioNames. Ultimately I want to be able to navigate through the taxonomic literature along these citation links, so that we can trace the fate of names through time.

But this is still only a start, papers such as Seltmann et al. illustrate other things that are possible once we have a large corpus of taxonomic literature available:

Seltmann, K. C., PĂ©nzes, Z., Yoder, M. J., Bertone, M. A., & Deans, A. R. (2013, February 18). Utilizing Descriptive Statements from the Biodiversity Heritage Library to Expand the Hymenoptera Anatomy Ontology. (C. S. Moreau, Ed.)PLoS ONE. Public Library of Science (PLoS). doi:10.1371/journal.pone.0055674


So, a lot still to be done. I hope to have achieved some of this if and when I write a follow up post on the status of BioNames in a year's time.