Thursday, June 14, 2012

Taxonomy and the nine billion names of God

In Arthur C. Clarke's short story The Nine Billion Names of God Tibetan monks hire two programmers to help them generate all the the possible names of God. The monks believe that the purpose of the Universe is to generate those names, once that goal is achieved the Universe will end. As the understandably skeptical programmers leave having completed their task, they look up into the sky and notice that "overhead, without any fuss, the stars were going out."

Leaving aside the delicious irony that arises if we recast this story with the monks replaced by taxonomists, much of our work with taxonomic names seems to be enumerating endless permutations of the same names. Part of the problem is the way some databases store and provide access to names.


The simplest way to represent a taxonomic name is to just have the name (the "canonical name"), without additional bits such as the taxonomic authority. In my view, any taxonomic database that serves names should provide the canonical name. I'm not arguing that they shouldn't provide taxonomic authority information (ideally separately, but could also be as part of a canonical name + authority string), I just want them to also provide just the canonical name. For some reason this seems to upset people (e.g., this thread on the TDWG mailing lists), so let me explain why I think this matters.

Most people use taxonomic names without the authority (just Google a taxonomic name with and without it's authority and compare the number of hits). So, if your goal is to be of service to your users, make sure you provide the canonical name.

Then there is the issue of integrating data from different sources. The more parts to the name the more scope there is for ambiguity. For example, my first ever publication was a description of a new species of peacrab, Pinnotheres atrinicola, published in:

Page, R. D. M. (1983). Description of a new species of Pinnotheres , and redescription of P. novaezelandiae (Brachyura: Pinnotheridae) . New Zealand Journal of Zoology, 10(2), 151–162. doi:10.1080/03014223.1983.10423904

If we look for this name in ION we discover three records:

Pinnotheres atrinacolaurn:lsid:organismnames.com:name:1192320
Pinnotheres atrinicolaurn:lsid:organismnames.com:name:371872
Pinnotheres atrinicola Page 1983urn:lsid:organismnames.com:name:371873


Two are duplicates of "Pinnotheres atrinicola", with and without the authority, one is a misspelling ("Pinnotheres atrinacola"). Given just the name we already see that it's easy for people to get the spelling wrong and generate lexical variants.

If we now add the authority we get more potential for variation. ION write the authority as "Page 1983" (no comma), but other databases such as WoRMS write it as Page, 1983 (with comma). So we now have two variations of the name, and two for the authority, so 4 possible strings if we include both name and authority. This combinatorial explosion means that we can rapidly generate lots of strings that are fundamentally the same.

I'm not arguing that taxonomic authorities aren't useful, and I want them wherever they are known, but insisting that databases serve name + authority to the exclusion of just the canonical name is a recipe for disaster. One could argue that users can parse the string into name and authority components, but that's a headache (just take a look at taxon-name-processing for details). Why make users go through hoops to get basic information?

Another reason I'm wary of taxonomic authority strings is that people don't always understand the conventions. For example, in my previous post I used the following example for names that differed in authority string:

  • Demansia torquata Günther 1862
  • Demansia torquata (Günther, 1862)

The use of parentheses seems a small difference, but (a) it means the strings are different, and (b) the presence or absence of parentheses changes the meaning of the authority. In this example, Demansia torquata Günther 1862 means that Günther is the original author of the name Demansia torquata, and so if I search Günther's publications from 1862 for "Demansia torquata" I will find that name. Demansia torquata (Günther, 1862), on the other hand, means that Günther originally described this species in 1862, but he placed it in a different genus, so my search for "Demansia torquata" in 1862 is likely to be fruitless. So, if the authority is actually (Günther, 1862) but a database tells me it's Günther, 1862 I'd be wasting my time looking for the name in 1862.

As it turns out, this snake was originally described as Diemansia torquata (see "On new species of snakes in the collection of the British Museum" http://biostor.org/reference/50221). The genus name Diemansia differs from Demansia, hence (Günther, 1862) should be correct, but it looks like Diemansia and Demansia are just some of the variations of the same snake genus (see for example http://biodiversitylibrary.org/page/22393791). *Sigh*

Variation in taxonomic authority extends beyond parentheses. In a post on clustering strings I used examples of taxonomic authorities for the genus Helicella:

Ferrusac 1821
Bonavita 1965
Ferussa 1821
Fer.
Lamarck 1812
Ferussac 1821

There are six different strings here which correspond to three different authorities. In this example the name Helicella is a homonym (same name used for different taxa) so having the taxonomic authority can help decide which name is actually meant, but people can't seem to agree on how to spell the authority names, and in other cases they might not agree on dates of publication, hence we get variations such as those above. Even when authorities are useful, they come at a cost. And that's not even considering chresonyms where the authority isn't the original author, but instead is a form of citation of the use of a name.

All of this variation is a cause of ambiguity, and when we combine permutations of taxonomic names and taxonomic authorities, things start to get messy. Indeed, I'd argue that projects such as the Global Names Index (GNI) are essentially doing what Arthur C. Clarke's monks were doing, trying to capture near endless permutations of the same names. Given this, it seems crazy not to try and keep things as simple as possible. In the vast majority of cases I want the name, I don't want the rest of the cruff attached to it. Taxonomic authorities are really just proxies for citation, so lets focus on getting that information linked to names, and stop making life difficult for users.