Tuesday, January 18, 2011

Quantum treemaps meet BHL and the Australian Faunal Directory

One of the things I'm enjoying about the Australian Faunal Directory on CouchDB is the chance to play with some ideas without worrying about breaking lots of code or, indeed, upsetting any users ('cos, let's face it, there aren't any). As a result, I can start to play with ideas that may one day find their way into other projects.

One of these ideas is to use quantum treemaps to display an author's publications. For example, below is a treemap showing publications by G A Boulenger in my Australian Faunal Directory on CouchDB project. The publications are clustered by journal. If a publication has been found in BioStor the treemap displays a thumbnail of that publication, otherwise it shows a white rectangle. At a glance we can see where the gaps are. You can view a publication's details simply by clicking on it.

boulenger.png

The entomologist W L Distant has a more impressive treemap, and clearly I need to find quite a few of his publications.
distant.png
I quite like the look of these, so may think about adding this display to BioStor. I may also think about using treemaps in my ongoing iPad projects. If you want to see where I'm going with this then take a look at Good et al. A fluid treemap interface for personal digital libraries.

Notes
The quantum treemap is computed using some rather ugly PHP I wrote, based on this Java code. I've not implemented all the refinements of the original Java code, so the quantum treemaps I create are sometimes suboptimal. To avoid too much visual cluster I haven't drawn a border around each cell, instead I use CSS gradients to indicate the area of the cell (if you're using Internet Explorer the gradient will be vertical rather than going from top left to bottom right). The journal name is overlain on the cell contents, but if you are using a decent browser (i.e., not Internet Explorer) you can still click through this text to the underlying thumbnail because the text uses the CSS property
.overlay { pointer-events: none; }
I learnt this trick from the Stack Overflow question Click through div with an alpha channel.

Friday, January 14, 2011

The demise of phthiraptera.org and the perils of using Internet domain names as identifiers

When otherwise sensible technorati refer to "owning" a domain name, it makes me want to stick forks in my eyeballs. We do not "own" domain names. At best, we only lease them and there are manifold ways in which we could lose control of a domain name - through litigation, through forgetfulness, through poverty, through voluntary transfer, etc. Once you don't control a domain name anymore, then you can't control your domain-name-based persistent identifiers either. - Geoffrey Bilder interviewed by Martin Fenner
Geoffery Bilder's comments about the unsuitability of URLs as long term identifiers (as opposed, say, to DOIs) came to mind when I discovered that the domain phthiraptera.org is up for sale:

Snapshot 2011-01-14 07-47-39.png

This domain used to be home to a wealth of resources on lice (order Phthiraptera). I discovered that ownership of the domain had expired when a bunch of links to PDFs returned by an iSpecies search for Collodennyus all bounced to the holding page above. Phthiraptera.org was owned by the late Bob Dalgleish. After his death, ownership of the domain lapsed, and it's now up for sale. Although much of the content of Phthiraptera.org has been moved to phthiraptera.info, URLs containing phthiraptera.org still turn up in search results, especially ones that have been cached (for example, in iSpecies). Given that much of the content is still available the loss isn't total, but anyone relying on links containing phthiraptera.org to point to content (such as a PDF), or to identify that content (such as a publication) will find themselves in trouble. Although ideally Cool URIs don't change, in practice they do, and with alarming frequency. Furthermore, in this case, because ownership of phthiraptera.org has lapsed, there's no opportunity to create redirects from URLs with phthiraptera.org to the equivalent content in phthiraptera.info (leaving aside the issue that phthiraptera.info is not a mirror of phthiraptera.org, so exactly what the redirects would point to is unclear).

Identifiers based on domain names, such as URLs and LSIDs are attractive because the DNS helps ensure global uniqueness, and HTTP provides a way to resolve the identifier, but all this is contingent on the domain itself persisting. For more on this topic I recommend reading Martin Fenner's interview of CrossRef's Geoffrey Bilder, from which I took the opening quote.

Tuesday, January 11, 2011

Why won't The Plant List won't let me do this?

In my last post I discussed why I thought the decision of The Plant List to use a restrictive license (CC-BY-NC-ND) was such a poor choice. CC-BY-NC-ND states that
You may not alter, transform, or build upon this work.
To make this point more concrete, I've created this site:

Experiments with The Plant List

to show the kinds of things that The Plant List's choice of license prevents the taxonomic community from doing. As a first step I'm exploring linking the names in the list to the primary scientific literature, as this video demonstrates:

The Plant List from Roderic Page on Vimeo.


For example, we can take a name like Begonia zhengyiana Y.M.Shui, parse the bibliographic citation provided by The Plant List (via IPNI), and locate the actual paper online, in this case it's freely available as a PDF:



Now we can see a drawing of the plant, and instead of simply trusting that the compilers of The Plant List have correctly interpreted this paper, we can see for ourselves. Down the track, we could imagine mining this paper for details about the plant, such as its morphology and geographic distribution. This requires the link to the original literature, which The Plant List lacks.

A good chunk of the recent plant taxonomic literature has DOIs, for example journals such as the Kew Bulletin and Novon. Playing with some scripts I've managed to associate nearly 9000 accepted names with a DOI, and that's by looking at only a few journals. There are lots more DOIs to be found, but because of the way botanical nomenclators record references (see my post Nomenclators + digitised literature = fail) it can be something of a challenge to find them. This task isn't helped by the fairly lax way some publishers enter data in CrossRef (Cambridge University Press I'm looking at you). The other obvious source of digitised literature is, of course, BHL, and that's next on the list of resources to play with.

Experiments with The Plant List is very crude, and I've barely scratched the surface of linking names to primary literature. That said, given that there are exactly zero links between names and digital literature in The Plant List, I'd argue that my site adds value to the data in that The Plant List. And that's my point — by making data available for others to play with, you enable others to add value to that data. By choosing a CC-BY-NC-ND license, The Plant List has killed that possibility.

So, my question for The Plant List is "why did you do that?"