Wednesday, January 31, 2007

When phylogenetic names would be useful

To avoid being charged with being consistent (unlikely, I know), despite being underwhelmed by phylogenetic names in the context of TreeBASE (see conversation with David Marjanović in the previous post), I think they could be very useful in annotating phylogenetic trees. One of the things that impresses me most about Google Earth is the community that has developed around this tool (take a look at the latest issue of Sightseer). Lots of people are contributing content to Google Earth, and I think this is driven in part by the ease of creating the KML files that Google Earth uses, and the ease of referring to a geographic location (i.e., latitude and longitude).

Imagine having a similar tool for phylogenetics, where people could add annotations (such as "wings developed here", "oldest fossil found at this site", "pollination mode changed here", "change in base composition took place here", etc.). A bit like MacClade on steroids, or a community driven version of the TaxonTree tool developed by the University of Maryland's Human-Computer Interaction Lab (see their page on Biodiversity Informatics Visualization).

Leaving aside the issue of file format, we'd need a way of unambiguously locating a point on a tree. An obvious solution is to use least common ancestors (same concept as most recent common ancestors), as used in phylogenetic nomenclature (although not in apomorphy-based definitions, which strike me as a disaster waiting to happen).

It's in this sort of context that I think phylogenetic names show great promise. For some nice examples of using LCA's to identify nodes, take a look at the mor server (seemingly down at present, but described in a paper by David Hibbett and colleagues doi:10.1080/10635150590947104), and Michael Sanderson's program r8s. The mor package uses phylogenetic definitions of names to locate the corresponding groups in a large phylogenetic tree, and r8s has the MRCA command to attach age constraints to nodes in a tree.

Tuesday, January 30, 2007

Quixotic(?) tree of Life visualisation

The Ant Room has a nice post on Visualizing the tree of life, with some cool links. And just to balance that, Donat Agosti drew my attention to Ford Doolittle and Eric Bapteste's PNAS article "Pattern pluralism and the Tree of Life hypothesis" doi:10.1073/pnas.0610699104. The abstract:
Darwin claimed that a unique inclusively hierarchical pattern of relationships between all organisms based on their similarities and differences [the Tree of Life (TOL)] was a fact of nature, for which evolution, and in particular a branching process of descent with modification, was the explanation. However, there is no independent evidence that the natural order is an inclusive hierarchy, and incorporation of prokaryotes into the TOL is especially problematic. The only data sets from which we might construct a universal hierarchy including prokaryotes, the sequences of genes, often disagree and can seldom be proven to agree. Hierarchical structure can always be imposed on or extracted from such data sets by algorithms designed to do so, but at its base the universal TOL rests on an unproven assumption about pattern that, given what we know about process, is unlikely to be broadly true. This is not to say that similarities and differences between organisms are not to be accounted for by evolutionary mechanisms, but descent with modification is only one of these mechanisms, and a single tree-like pattern is not the necessary (or expected) result of their collective operation. Pattern pluralism (the recognition that different evolutionary models and representations of relationships will be appropriate, and true, for different taxa or at different scales or for different purposes) is an attractive alternative to the quixotic pursuit of a single true TOL.

Encyclopedia of Life

Imagine an electronic page for each species of organism on Earth, available everywhere by single access on command. The page contains the scientific name of the species, a pictorial or genomic presentation of the primary type specimen on which its name is based, and a summary of its diagnostic traits. The page opens out directly or by linkage with other databases such as ARKive, Ecoport, and GenBank. It comprises a summary of everything known about the species' genome, proteome, geographic distribution, phylogenetic position, habitat, ecological relationships, and, not least, its perceived practical importance for humanity. - E . O. Wilson, 2003

E. O. Wilson's much used quote features prominently on the EoL Informatics web site. This project involves the Smithsonian Institution, Field Museum, Harvard University, Biodiversity Heritage Library, and the MBL. I will be at the Informatics Workshop next month.

For my own toy efforts in this direction, see iSpecies.


Comments by David Marjanović elsewhere on this blog (here and here) about TreeBASE, classification and Phylocode have prompted me to write a little bit about why I'm underwhelmed by the Phylocode. Suppose I have the question:
"find me all studies in TreeBASE that contain birds"?

How do I answer this? Well, my approach is to do the following. Firstly, I attempt to map every name in TreeBASE onto a name in an external database, such as NCBI Taxonomy, uBio, etc. Once I've done this, I download the NCBI taxonomic classifcation, and use it to query the mapped names.

Querying a classification
The basic idea for querying trees is nicely explained by Aaron Mackey in his article Relational Modeling of Biological Data: Trees and Graphs, and I've used this approach elsewhere in the Glasgow Taxonomic Name Server. You take the tree, compute left and right visitation numbers, and then use those numbers to query the data base. For more details take a look at my notes here. For example, given the tree below (taken from Aaron's article) let's imagine that node 4 (shown in blue in the diagram) is "Aves".

Then, if I search for nodes with a left_id > 3, and a right_id < 10, I get nodes 10, 11, and 12. These are the birds. So, I find each name in TreeBASE that maps onto these nodes, and those are the birds in TreeBASE.

Phylogenetic nomenclature
How would I do this using phylogenetic nomenclature? Well, I can't just use the specifiers of a name, because there is no guarantee that those specifiers will be in the tree. For example, Paul Sereno's Taxon Search site lists the definition of "Aves" as:
The least inclusive clade containing Archaeopteryx lithographica Meyer 1861 and Passer domesticus (Linnaeus 1758).

Neither taxon occurs in TreeBASE to date! So, how would I search for birds (and be confident that I can retrieve all the studies on birds)?
Well, if I had a larger tree (such as a supertree or, say, the NCBI classification) that had the specifiers (i.e., in this case it had Archaeopteryx and the sparrow), I could find the least common ancestor (LCA) or those two taxa on the tree, that node would be "Aves", and then I can use the technique described above to find all studies on birds.

So, I'm underwhelmed. In practice the approach is the same, use a large tree, locate the node from which all your taxa descend, and find studies with those taxa. Using a classification such as NCBI, I have a large tree complete with internal nodes labelled. Hence, to find all studies containing birds I find the node labelled "Aves" and do the query. To use phylogenetic names, I also need a large tree, and I need to look up the LCA of the specifiers, then do the query in the same way. So, in practise the difference is minor, although for phylogenetic names there is the issue of what tree to use. One could argue that the classification approach is ready to go - just grab the NCBI tree.

Now, there are problems of course. For palaeontologists, the nearly complete lack of extinct taxa in the NCBI tree is a problem, because unless a TreeBASE study has at least one extant taxon that has also been sequenced, the approach I've outlined above won't work. But, bottom line, I don't see how in practice we get away from needing a large tree in order to sensibly query TreeBASE. In which case, the Phylocode makes little substantive difference, contra some of David's comments.

Thursday, January 18, 2007

The joys of mapping names in TreeBASE

Here's a fun example of how databases get out of sync, making them harder to link up. TreeBASE taxon T4628 is labelled Bolitoglossa sombra, which doesn't exist in NCBI's taxonomy database, which is odd as the study by Mueller et al. (S1139) is a molecular phylogeny (doi:10.1073/pnas.0405785101), and the taxon concerned has had its whole mitochondrial genome sequenced. In the paper this taxon is listed as "Bolitoglossa sp. nov." and in NCBI's database it is Bolitoglossa n. sp. RLM-2004 (taxid 291262).

So why Bolitoglossa sombra in TreeBASE?
Well, Googling finds Darrel Frost's Amphibian Species of the World page on this species, which lists the name "Bolitoglossa sombra Hanken, Wake, and Savage, 2005, Copeia, 2005: 234.". Googling again finds a PDF of this paper linked to from David Wake's web site, and by Googling on "Copeia" and "BioOne" I get a DOI to the paper (doi:10.1643/CH-04-083R1.

Reading the paper doesn't make me any the wiser1, until I get the supplementary information for Mueller et al. and discover that Bolitoglossa sp. nov. is specimen MVZ 225875. Searching the PDF of Hanken et al., I find (p. 236)
Three juveniles (MVZ 225875–76, 225878) were generally black but had some obscure whitish patches, which were most evident near the tail base.

So, MVZ 225875 is Bolitoglossa sombra. I confirm this by doing a DiGIR lookup on MVZ 225875 using a script I wrote (doing this is an absolute pain because of the way DiGIR is constructed, and because if doesn't provide resolvable identifiers for specimens). You can view the specimen record directly at MVZ.

1Doh! If I'd read the paper properly, MVZ 225875 is listed as one of the paratypes of Bolitoglossa sombra.

What's your point?
My point is this is a lot of tedium to go through to link up the following items:
  1. TreeBASE taxon
  2. NCBI taxonomic record
  3. NCBI genomic record
  4. Publication of scientific name
  5. Specimen sequenced

Each of these records exist and have identifiers (of varying utility), and in the case of all but the TreeBASE record, there are ways to retrieve metadata about the record in XML format. Yet, these records exist in isolation and haven't been linked, which means I cannot easily connect them just by looking at each database. For example, looking at NCBI's record for Bolitoglossa n. sp. RLM-2004, I have no idea that this amphibian has a phylogeny in TreeBASE, or has been described in the scientific literature and is now called Bolitoglossa sombra.
This is just crazy, and is not that hard to fix if we have globally unique identifiers for digital records that are resolvable, and ways to harvest metadata about those records. For more rants and examples on this theme see SemAnt.

Tuesday, January 16, 2007

A manifesto

The funding of pPOD mentioned earlier today motivates me to write some notes on what I think "core database technologies for enabling the integration of AToL data" could, or indeed, should be about. Much of what follows I've mentioned elsewhere on the iPhylo blog (for example here and related blogs SemAnt and iSpecies) but it seems useful to bring this together here.

It's not about algorithms
CIPRES seems to be to be basically about putting "go faster" stripes on phylogenetic algorithms. I don't mean this to be as dismissive as it sounds, this is intellectually challenging stuff. It's just that projects such as TreeBASE seem adrift in that environment. See my CIPRES talk for more.

It's about integration
Integrating biodiversity information is a hot topic, and I think this is were most of easy stuff is. Easy in the sense that much of the intellectual work has been done, most of it elsewhere. We have GUIDs (e.g., DOIs and LSIDs), RDF, triple stores, and query languages (e.g., SPARQL). Providing we avoid getting embroiled and/or bogged down in some of the details, I think this stuff is technically pretty straightforward. I've been experimenting on some of these ideas in the context of a project integrating information on ants, see my SemAnt blog for a record of this work. There are also related posts on iSpecies.

It's about new queries
My own view is that much of the work on tree searching in TreeBASE, for example by Jason Wang and collaborators — while interesting — is misplaced. I don't get the sense that biologists are really interested in asking the question "find me trees like this". Rather, I think biologists are really interested in questions such as "find me trees that have x more closely related to y than to z", or "find me trees in which group x is/is not monophyletic". I think these are pattern matching queries, or more fundamentally, I think they are all in essence least common ancestor (LCA) queries. Indeed, once stripped of all the rhetoric about bringing classification into the 21st century (and the nonsense about renaming species), the phylocode boils down to named LCA queries.

What we also need are queries that deal with geography and time. Work on interval queries seems relevant here. Ideally we'd move beyond GIS queries to pattern matching geographically-labelled trees (finally providing tools for cladistic biogeography).

It's about branch lengths
Brian O'Meara's comments on thhis blog reminded me that I'd forgotten about edge (=branch) lengths. Although these are implicit in my discussion of chronograms (see below), as Brian notes:
While systematists may be most interested in relatedness, many biologists will use trees for investigating trait evolution or as ways to control for phylogenetic relatedness (contrasts), and for this they need branch lengths.

Brian also mentions the potential storage issue for Bayesian trees, i..e storing the results of MCMC runs. If we want to store only topologies, it also seemed to me that there might be some clever ways to store only the difference between successive trees, given that each tree is a perturbation of the previous one (e.g., a NNI). Storing edge lengths complicates this, although they too are related to those in the previous tree. Is there a smart way to store these things, or do we just gzip the tree file and stick it on a server?

It's about new visualisations
Bill Piel's work on putting phylogenies in Google Earth, and related work by Daniel Janies et al. (coming out soon in Systematic Biology) show the potential of geographic visualisation.

Earlier on this blog, I noted the parallel between genome browsers and chronograms.

Continuing the theme of visualising phylogenies, one thing which strikes me is the parallel between genome browsers that display annotation "tracks" (such as the UCSC Genome Browser) and illustrations of "chronograms" with geological periods and accompanying data, such as sea levels, isotope levels, etc. In my haste I couldn't find an example with a sea-level track, but I know they exist … In both cases there is a natural co-ordinate system (genome location and time, respectively) going from left to right, and annotations that can be added using the same frame of reference.

Dating phylogenetic trees is currently "hot", but phylogeny databases don't support dated trees.

It's about collaboration
I think there are lots of tools being developed elsewhere, such as Connotea, Flickr, and EditGrid that can be utilised (or used as sources of inspiration). These provide tools for managing bibliographic data, images, and spreadsheets. Let's not reinvent this stuff. For example, Connotea can be integrated with TreeBASE, Flickr can be used to store images with metadata, and EditGrid can be used to create collaborative data matrices, as well as simple annotations. And, speaking of annotations, blogs seem to provide ideal tools for this.
My point here is developing domain-specific tools for this stuff seems to me to be a huge mistake.

In summary, my own perspective is that one way to tackle this problem is to take advantage of the swarm of community-driven, open API, folksonomy-based tools that are flooding the web.

processing PhylOData (pPOD)

Some good news! pPOD, a NSF-funded project on integrating data from AToL (A Tree of Life) projects has been funded. Val Tannen (right) is the co-ordinating PI. I'm a consultant, which means more opportunities to mouth-off about phylogenetic data and databases (for earlier examples see TreeBASE rocks, TreeBASE talk at CIPRES, and Towards the ToL database - some visions).
The project is called pPOD, and has a wiki. The goals are:

  1. Develop an extensible core data model for phylogenetic data.
    The model will include a query language as well as extensible data structures and will benefit from research on efficiently querying phylogenetic data.

  2. Develop schema mappings for peer-to-peer data integration and exchange, where a project can join existing integration groups by providing mappings between the schema of their data and the core data model or one of its extensions.

  3. Develop a scientific workflow system (lab notebook) that will allow research groups to put together the data integration components with the local database access components and with the analysis tools.

The project is collecting suggestions, experience and, eventually, usecases from the community, and is inviting contributions to its wiki.

Monday, January 08, 2007

chem-bla-ics: Including SMILES, CML and InChI in blogs

Browsing eventually lead to Egon Willighagen's post about
Including SMILES, CML and InChI in blogs, which talks about the sort of things I'd like to do in biodiversity informatics. I'm particularly keen on using blogs as annotation tools. One more for the reading list... (see also RDFa.