Friday, October 19, 2012

The failure of phylogeny databases


It is well known that phylogeny databases such as TreeBASE capture a small fraction of the published phylogenies. This raises the question of how to increase the number of trees that get archived. One approach is compulsion:


In other words:
  1. Databasing trees is the Right Thing™ to do
  2. Few people are doing the Right Thing™
  3. This is because those people are bad/misguided and must be made to see the light

I want to suggest an alternative explanation:
  1. It is not at all obvious that databasing trees is useful
  2. The databases we have suck
  3. There's no obvious incentive for the people producing trees to database them
Why do we need a database of trees?

That we don't have a decent, widely used database of trees suggests that the argument still has to be made. Way back in the mid 1990's when TreeBASE was first starting I was at Oxford University and Paul Harvey (coauthor of The Comparative Method in Evolutionary Biology) was sceptical of its merits. Given that the comparative method depends on phylogenies, and people like Andy Purvis were in the Harvey lab building supertrees (http://dx.doi.org/10.1098/rstb.1995.0078) this may seem odd (it certainly did to me) but Paul shared the view of many systematists. Phylogenies are labile, they change with increased data and taxon sampling, hence individual trees have a short life span.

Data, in contrast, is long-lived. You'd happily reuse GenBank sequences published a decade ago, you probably wouldn't use a decade-old phylogeny. I made this point in an earlier post about the data archive Dryad (Data matters but do data sets?). A problem facing packages of data (such as papers, data sets, and phylogenies) is that the package itself may be of limited interest, beyond reproducing earlier results and benchmarking. In the case of phylogenies, if someone has a tree ((a,b),c) and someone else has a tree ((d,e),f), it's not obvious that we can combine these. But if we have sequences for the same gene from the same six taxa we can build a larger tree, say (((a,d),(b,e)),(c,f)).

I think this is part of the reason why GenBank works. Yes, there is compulsion (it's very hard to publish on sequences if you haven't deposited the data in GenBank), but there are clear benefits of depositing data. As the database grows we can do bigger analyses. If you are trying to identify a species based on its DNA, the chances are that the nearest sequence will have been deposited by somebody else. By depositing data your work it also lasts longer than if people just had the paper (your tree is likely to be outdated, that sequence from a rare, hard to obtain species might be used for decades to come).

Note that I'm not saying a database of trees isn't a good idea, but there seems to be an assumption that it is so obvious that it doesn't need justification. Demonstrably this isn't the case. Maybe we should figure out what we'd want to do with such a database, then tackle how we'd make that possible. For example, I'd want to query a phylogeny database geographically (show me trees from this part of the globe), by ecological association (find the trees for any parasites on this clade), by temporal period (what clades originated in the Miocene?), by data (what trees used this sequence which we now know is chimeric?), by topology (have we settled on the sister group to snakes yet?), and so on. I would also argue that much of this is doable, but might not actually require archiving published phylogenies. Personally I think anybody tackling these questions would do well to use PhyLoTA as their starting point.

TreeBASE sucks

Yes, I'm as sick of saying this as you are of reading it. But it doesn't change the fact that just about everything about TreeBASE from the complexity of the underlying data model, the choice of programming language, the use of a Java applet to display trees, the Byzantine search interface, and the voluminous XML output make TreeBASE a bag of hurt. None of this would matter much if it was an indispensable part of people's research toolkit, but this isn't the case. If you are trying to convince people of the benefits of sharing trees you really want a tool that makes a it seem a no brainer. We aren't there yet.

The "fuck this" point

In a great post on the piracy threshold, Matt Gemmell argues that piracy is largely the fault of content providers because they make being honest too difficult. How many times have you wanted to buy something such as a book or a movie only to discover that the content provider doesn't sell it in your part of the world (e.g., in the iBooks store in the US but not the UK) or doesn't provide it in the media you want (e.g., DVD but not online)? To top it off every time you go to the movies you are subjected to emotional blackmail or threats of unlimited fines if you were to copy the movie you already paid to watch?

6892585935 32d4e21e77 o

I think databases have the same "fuck this" threshold. If you are asking people to submit data you want to make it as easy as possible. And you want at least some of the benefits to be immediate and obvious. Otherwise you are left with coercing people, and that's being, at best, lazy.

If you want an example of how to do it right, look at Mendeley's model. They want to build a public cloud of academic papers, a laudable goal, the Right Thing™ to do. But they sell the idea not as a public good, not as the Right Thing™, nor by trying to compel people (they can't, they're a private company). Instead they address a major point of pain - where the hell did I put that PDF? - and make it trivial to organise your collection of articles. Then they make it possible to back them up to the cloud, to view them on multiple devices, to share them, and viola, we get a huge database of publications. The sociology works. So, my question is, what would the equivalent be for phylogenetics?