Monday, August 26, 2013

Bat names, ligatures, and fonts

Notes on fonts in trying decipher some bat names (and sort out some synonyms). It can be difficult to tell "œ" (o + e) from "æ" (a + e) in some texts. For example, BHL page 35549499 has "Chœrephon":

Chœrephon

Chœrephon


Here is the name on BHL 37695497 (Catalogue of the Chiroptera in the collection of the British Museum):

Chærephon

Chærephon


This genus of bats is named for "Χαιρεφῶν, Aristophanes, Aves, 1296, 1564." according to Dobson 1874 (On the Asiatic species of Molossi), but is regarded as having been amended by Dobson 1878 to Chærephon. It took some time squinting at the BHL scan to see that Dobson 1878 and Dobson 1874 did actually write the name differently. Why does this matter? Well, the GBIF classification has both "Choerephon" and "Chaerephon" in its list of bat genera in the family Molossidae, and it can have one or other, but not both.

Thursday, August 15, 2013

BioNames update - taxonomic name timelines

One feature I've always wanted to have in BioNames is a timeline of taxonomic names. ION has one (see here), but I wanted a way to go from the timeline to the actual publications. In other words, if, say, there were approximately 99 bird names published in 2012, I want to see the papers that published those names.

As an example, you can go to http://bionames.org/timeline/Animalia/Chordata/Vertebrata/Aves and get a timeline of bird names:
Birds
The data is incomplete (I'm still processing and indexing the data) but you get a sense that the number of bird names being coined each year is fairly small. Actually, I was surprised it was as high as it is, but remember these are not the number of new species described each year. It does include new species (many of them are fossils in this case), but also higher taxa and nomenclatural changes (e.g., replacement names for homonyms, etc.). The timeline also only shows names that are "new" (i.e., not the new combinations that result when a species gets moved to a new genus), and only those names linked to a publication.

The timeline graphs are clickable, so you can click on a year and get a list of publications for that taxon for that year (sometimes this can take a while). You can click on the publications for more details, sometimes you can also view the full text.

The timeline page also shows a treemap of the taxonomic groups recognised by the ION database (the example below is for birds):

Treemap

Browsing different taxa shows some interesting patterns. For example, here are snakes:

Snakes
That huge spike on the far right? That's due to hundreds of names published by "Snake Man" Raymond Hoser (his activities have been the subject of an impassioned debate on TAXACOM).

The timeline for insects shows a major dip in new names that corresponds to the Second World War, followed by a big jump in the late sixties.

Insects

Smaller taxa, such as Teuthida, show a more episodic pattern where a single monograph can result in a prominent spike in the numbers in any one year (again, you can click on the spikes to see the actual publications):

Squid

Still a daunting mount of cleaning and linking to do, but it's one more way to explore the efforts of generations of taxonomists to discover and make sense of the diversity of animal life on the planet.




Wednesday, August 14, 2013

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

Continuing the theme of the failings of the GBIF classification I've been playing further with cluster maps to visualise the problem (see this earlier post for an introduction).

Browsing through bats in GBIF I keep finding the same species appearing more than once, albeit in different genera. As discussed in the gibbon example, GBIF merges several competing classifications for mammals, and these often don't agree on the "accepted name" for a species. In the absence of a decent database of taxonomic synonyms, GBIF ends up duplicating species, and each duplicate is often associated with different occurence data. If you are trying to get the distribution for a species this can be a disaster.

To get a sense of the scale of the problem I put together a simple tool to create cluster maps. The code is on github) and there is a live service at http://iphylo.org/~rpage/cluster-map/. The service takes a simple tab-delimited file that lists sets and their members, computes the overlap between the sets, calls Graphviz to layout a graph in SVG, then draws in the members of each cluster (phew).

The input file looks something like this:

Molossops aequatorianus
Chaerephon aloysiisabaudiae
Tadarida aloysiisabaudiae
Chaerephon ansorgei
Tadarida ansorgei
Molossus ater
Mormopterus petrophilus
Sauromys petrophilus


What can we do with this tool? Well, I created a quick list of all the species of bat in the family Molossidae according to GBIF. The sets are the bat genera, the members are the species (you can see the file here). I then ran this through the cluster map, and got something like this (this is only part of the cluster map):

Bats

(now can you see why I call these "papaya plots"?). Note that there are species names (i.e., specific epithets) in common to more than one genus. Some of these may be perfectly OK (it's not unusual for the same epithet to be used in different species, e.g. "major", etc.). But in many cases these bat species turn out to be the same species, just in different genera in different classifications. For example, GBIF has both Cynomops greenhalli and Molossops greenhalli. These are the same thing. Species in the genus Mormopterus may also occur in other genera. In some cases the issue is competing classifications, sometimes it is conflict over whether a species is a species or merely a subspecies, and some generic conflicts are because some genera are relegated to subgeneric status in some classifications. In short, it's an unholy mess.

Does this matter? Well, consider Mormopterus petrophilus and Sauromys petrophilus, which GBIF both regard as valid species (they're the same thing). Here are the distributions for the two different names in GBIF:

MormopterusSauromys


Depending on which name you use you'll get a very different picture of the distribution of this bat.

The next step is to figure out how to fix this. Is there a way we can automate fixing the GBIF classification so that it is not riddled with spurious duplicates like these?

Monday, August 05, 2013

GBIF and open biodiversity data: what license should GBIF use?

GBIF is asking for views on how it should license of data in the GBIF network. The full consultation document is available from Google Drive and DropBox. GBIF is:

...seeking input from all GBIF Participants and stakeholders on the following questions:
  1. Do you have any comments on the plan to associate all GBIF-mediated data with a machine readable licence?
  2. Do you have an opinion on the relative merits of Creative Commons, Open Data Commons or other licence types in the context of the GBIF network?
  3. Which of the two options described in section 8 of this document should GBIF pursue? If you support “Option 2”, would your position be modified if it resulted in a significant decrease in data published to the GBIF network?


The two options referred to above are:

Option 1 – Support restrictions on commercial use

Option 2 – Only support fully free-and-open data

If you have opinions on licensing biodiversity data, please read the consultation document and send your thoughts send to licensing@gbif.org by 5 September 2013.

Thursday, August 01, 2013

A use case for RDF in taxonomy

RDF Resource Description Framework Icon
Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases.

The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant. The basic idea is that errors in databases (in this case, UniProt) can be flagged by constructing queries in SPARQL that return results if there is a problem (for example if a sequence annotation is contradictory).

In recent posts I've been complaining about errors in the GBIF taxonomy, notably duplicate taxa that are synonyms. One way to tackle this would be to develop a set of SPARQL queries that we could use to flag potential problems. For example, if two names are objective synonyms then only one of them should be a node in the GBIF classification. If both exist then we have a problem. If we know a name is a homonym of an older name, but that name exists in the GBIF classification, then we could flag that as an issue. We could also construct queries that flag possible problems, even if we don't have precise information on synonymy. For example, in this post I noted that several frog species appear twice in the GBIF classification because GBIF has aggregated classifications that put these frogs in different genera. We could catch such cases by constructing a query to check whether the same species name (specific epithet) appeared in different genera within the same family.

The advantage of using RDF and SPARQL in this context is that that the queries are portable. Assuming everyone uses the same vocabulary (e.g., the TDWG LSID vocabularies) then queries can be constructed by one person (e.g., me) and then used by anyone who has their data in a triple store. We could develop a set of "taxonomy tests" that anyone could apply to their database.

This idea needs some more work, but it would be fun to play with some data and see how many kinds of errors or issues we can catch in this way.