Wednesday, November 28, 2007

Transitive reduction

Quick note to self, having stumbled on the Wikipedia page on transitive reduction. Given a graph like this:

the transitive reduction is:

Note that the original graph has an edge a -> d, but this is absent after the reduction because we can get from a to d via b (or c).


What's the point? Well, it occurs to me that a quick way of harvesting information about existing taxonomies (e.g., if we want to assemble an all embracing classification of life to hep navigate a database) is to make use of the titles of taxonomic papers, e.g., the title Platyprotus phyllosoma, gen. nov., sp. nov., from Enderby Land, Antarctica, an unusual munnopsidid without natatory pereopods (Crustacea: Isopoda: Asellota) gives us:

Crustacea -> Isopoda -> Asellota ->Platyprotus -> phyllosoma

From the paper, and or other sources we can get paths such as Asellota -> Munnopsididae -> Platyprotus and Isopoda -> Munnopsididae. Imagine that we have a set of these paths, and want to assemble a classification (for example, we want to grow the Species 2000 classification, which lacks this isopod). Here's the graph:


This clearly gives us information on the classification of the isopod, but it's not a hierarchy. The transitive reduction, however, is:



It would be fun to explore using this technique to mine taxonomic papers and automate the extraction of classifications, as well as names.

Tuesday, November 20, 2007

Interview

Paulo Nuin recently interviewed me for his Blind.Scientist blog. The interview is part of his SciView series.

Friday, November 16, 2007

Thesis online


One side effect of the trend towards digitising everything is that stuff one forgot about (or, perhaps, would like to forget about) comes back to haunt you. My alma mater, the University of Auckland is digitising theses, and my PhD thesis "Panbiogeography: a cladistic approach" is now online (hdl:2292/1999). Here's the abstract:

This thesis develops a quantitative cladistic approach to panbiogeography. Algorithms for constructing and comparing area cladograms are developed and implemented in a computer program. Examples of the use of this software are described. The principle results of this thesis are: (1) The description of algorithms for implementing Nelson and Platnick's (1981) methods for constructing area cladograms. These algorithms have been incorporated into a computer program. (2) Zandee and Roos' (1987) methods based on "component-compatibility" are shown to be flawed. (3) Recent criticisms of Nelson and Platnick's methods by E. O. Wiley are rebutted. (4) A quantitative reanalysis of Hafner and Nadler's (1988) allozyme data for gophers and their parasitic lice illustrates the utility of information on timing of speciation events in interpreting apparent incongruence between host and parasite cladograms. In addition the thesis contains a survey of some current themes in biogeography, a reply to criticisms of my earlier work on track analysis, and an application of bootstrap and consensus methods to place confidence limits on estimates of cladograms.

1990. Ah, happy days...

Thursday, November 15, 2007

Phyloinformatics workshop online


Slides from the recent Phyloinformatics workshop in Edinburgh are now online at the e-Science Institute. In case the e-Science Institute site disappears I've posted the slides on slideshare.


Heiko Schmidt has also posted some photos of the proceedings, demonstrating how distraught the particpants were that I couldn't make it.

Thursday, November 08, 2007

GBIF data evaluation


Interesting paper in PLoS ONE (doi:10.1371/journal.pone.0001124) on the quality of data housed in GBIF. The study looked at 630,871 georeferenced legume records in GBIF, and concluded that 84% of these records are valid. As examples of those that aren't, below is a map of legumes placed in the sea (there are no marine legumes).

Although the abstract warns of the dire consequences of data deficencies, the conclusions make for interesting reading:

The GBIF point data are largely correct: 84% passed our conservative criteria. A serious problem is the uneven coverage of both species and areas in these data. It is possible to retrieve large numbers of accurate data points, but without appropriate adjustment these will give a misleading view of biodiversity patterns. Coverage associates negatively with species richness. There is a need to focus on databasing mega-diverse countries and biodiversity hotspots if we are to gain a balanced picture of global biodiversity. A major challenge for GBIF in the immediate future is a political one: to negotiate access to the several substantial biodiversity databases that are not yet publicly and freely available to the global science community. GBIF has taken substantial steps to achieve its goals for primary data provision, but support is needed to encourage more data providers to digitise and supply their records.

Wednesday, October 31, 2007

Amber spider

Really just a shameless attempt to get one over David Shorthouse, but there has been some buzz about Very High Resolution X-Ray Computed Tomography (VHR-CT) of a fossil of Cenotextricella simon.


The paper describing the work is in Zootaxa (link here). Zootaxa is doing great things for taxonomic publishing, but they really need to get some sort of stable identifier set up. Linking to ZooTaxa articles is not straightforward. If they had DOIs (or even OpenURL access) they wuld make it much easier for people to convert lists of papers that include ZooTaxa publications to lists of resolvable links.

Sunday, October 28, 2007

Universal Serial Item Names

Following on from the discussion of BHL and DOIs, I stumbled across some remarkable work by Robert Cameron at SFU. Cameron has developed Universal Serial Item Names (USIN). The approach is spelled out in detail in Towards Universal Serial Item Names (also on Scribd). This lengthy document deals with how to develop user-friendly identifiers for journal articles, books, and other documents. The solution looks less baroque than SICIs, which I've discussed earlier.

There is also a web site (USIN.org), complete with examples and source code. Identifiers for books are straightforward, for instance bibp:ISBN/0-86542-889-1 identifies a certain book:

For journals things are slightly more complicated. However, Cameron simplified things a little in his subsequent paper Scholar-Friendly DOI Suffixes with JACC: Journal Article Citation Convention (also on Scribd).
JACC (Journal Article Citation Convention) is proposed as an alternative to SICI (Serial Item and Contribution Identifier) as a convention for specifying journal articles in DOI (Digital Object Identifier) suffixes. JACC is intended to provide a very simple tool for scholars to easily create Web links to DOIs and to also support interoperability between legacy article citation systems and DOI-based services. The simplicity of JACC in comparison to SICI should be a boon both to the scholar and to the implementor of DOI responders.

USIN and JACC use the minimal number of elements in order to identifier an article, such as journal code (e.g., ISSN or an accepted acronym), volume number, and starting page. Using ISSNs ensures globally unique identifiers for journals, but the scheme can also use acronyms, hence those journals that lack ISSNs could be catered for. The scheme is simple, and in many cases will provide the bare minimum of information necessary to locate an item via an OpenURL resolver. Indeed, one simple way to implement USIN identifiers would be to have a service that takes URIs of the form <journal-code>:<volume>@<page> and resolves them behind the scenes using OpenURL. Hence we get simple identifiers that are resolvable, without the baroque approach of SICIs.

When I get the chance I may add support for something like this to bioGUID.

Saturday, October 27, 2007

Taxonomy is dead, long live taxonomy


No, not taxonomy the discipline (although I've given a talk asking this question), but taxonomy.zoology.gla.ac.uk, my long-running web server hosting such venerable software projects as TreeView, NDE, and GeneTree, along with my home page.

A series of power cuts in my building while I was away finally did for my ancient Sun Sparcstation5, running the CERN web server (yes, it's that old). I can remember the thrill (mixed with mild terror) of taking delivery of the Sparcstation and having to manually assemble it (the CD ROM and floppy drives came separately), and the painful introduction to the Unix command line. The joy of getting a web server to run (way back in late 1995), followed by Samba, AppleTalk, and CVS.

For the time being a backup copy of the documents and software hosted on the Sparcstation are being served from a Mac. The only tricky thing was setting up the CVS server that I use for version control for my projects. Yes, I know CVS is also ancient, and that Linus Torvalds will think me a moron, but for now it's what I use. CVS comes with Apple's developer tools, but I wanted to set up remote access. I found the articles by Daniel Côté Setting up a CVS server on Mac OS X and on Mac OSX Hints Enable CVS pserver on 10.2 to be helpful. Basically I initialised a new CVS repository, then copied across the backed repository from a DVD. I then replaced some files in CVSROOT that listed things like the modules in the repository and notifications sent when code is comitted. Getting the pserver up and running required some work. I created a file called cvspserver inside /etc/xinetd.d/, with the following contents.

service cvspserver
{
disable = no
socket_type = stream
wait = no
user = root
server = /usr/bin/cvs
server_args = -f --allow-root=/usr/local/CVS pserver
groups = yes
flags = REUSE
}

Then I started the service:
sudo /sbin/service cvspserver start

So far, so good, but I couldn't log in to CVS. Discovered that this is because Mac OS X uses ShadowHash authentication_authority. Hence, on a Mac CVS won't use the system user names and passwords (probably a good thing). Therefore, we uncomment the line
# Set this to "no" if pserver shouldn't check system users/passwords
SystemAuth=no

in the file CVSROOT/config, then create a file CVSROOT/passwd. This file contains the username, hash password, and the actual Mac OS X username (nicely explained in Daniel Côté's article). To generate a hash password, do this:

darwin: openssl passwd
Password: 123
Verifying - Password: 123
yrp85EUNQl01E

At last it all seems to work, and I can get back to coding. This is about as geeky as this blog gets, but if you want a real geek overload, spend some time listening to this talk by Linus Torvalds.

Thursday, October 25, 2007

BHL and DOis

In a series of emails Chris Freeland, David Shorthouse, and I have been discussing DOIs in the context of the Biodiversity Heritage Library (BHL). I thought it worthwhile to capture some thoughts here.
In an email Chris wrote:
Sure, DOIs have been around for a while, but how many nomenclators or species databases record them? Few, from what I've seen - instead they record citations in traditional text form. I'm trying to find the middle ground between guys like the two of you, who want machine-readable lit (RDF), and most everyone else I talk with, including regular users of Botanicus & BHL, who want human-readable lit (PDF). I'm not overstating - it really does break down into these 2 camps (for now), with much more weight over on the PDF side (again, for now).

I think the perception that there are two "camps" is unfortunate. I guess for a working taxonomist, it would be great if for a given taxonomic name there was a way to see the original publication of the name, even if it is simply a bitmap image (such as a JPEG). Hence, a database that links names to images of text would be a useful resource. If this is what BHL is aiming for, then I agree, DOIs may seem to be of little use, apart from being one way to address the issue of persistent identifiers.

But it seems to me that there are lots of tasks for which DOIs (or more precisely, the infrastructure underlying them) can help. For example, given a bibliographic citation such as

Fiers, F. and T. M. Iliffe (2000) Nitocrellopsis texana n. sp. from central TX (U.S.A.) and N. ahaggarensis n. sp. from the central Algerian Sahara (Copepoda, Harpacticoida). Hydrobiologia, 418:81-97.

how do I find a digital version of this article? Given this citation

Fiers, F. & T. M. Iliffe (2000). Hydrobiologia, 418:81.

how do I decide that this is the same article? If I want to see whether somebody has cited this paper (and perhaps changed the name of the copepod) how do I do that? If I want follow up the references in this paper, how do I do that?

These are the kinds of thing that DOIs address. This article has the DOI doi:10.1023/A:1003892200897. This gives me a globally unique identifier for the article. The DOI foundation provides a resolver whereby I can go to a site that will provide me with access (albeit possibly for a fee) to the article. CrossRef provides an OpenURL service whereby I can
  • Retrieve metadata about the article given the DOI

  • Given metadata I can search for a DOI

To an end user much of this is irrelevant, but to people building the links between taxonomic names and taxonomic literature, these are pressing issues. Previously I've given some examples before where taxonomic databases such as Cataloggue of Life and ITIS store only text citations, not identifiers (such as DOIs or Handles). As a result, the user has to search for each paper "by hand". Surely in an ideal world there would be a link to the publication? If so, how do we get there? How do IPNI, Index Fungorum, ITIS, Sp2000, ZooBank, and so on link their names and references to digitised content? This is where a CrossRef-style infrastructure comes in.

Publishers "get this". Given the nature of the web where users expect to be able follow links, CrossRef deals with the issue of converting the literature cited section of a paper into a set of clickable links. Don't we want the same thing for our databases of taxonomic names? And, don't we want this for our taxonomic literature?

It is worth noting that the perception that DOIs only cover modern literature is erroneous. For example, here's the description of Megalania prisca Owen (doi:10.1098/rstl.1859.0002), which was published in 1859. The Royal Society of London has DOIs for articles published in the 18th century.

If the Royal Society can do this, why can't BHL?

Monday, October 22, 2007

Phyloinformatics workshop - primal scream

Argh!!! The phyloinformatics workshop at Edinburgh's eScience Centre is underway (program of talks available here as an iCalendar file), and I'm stranded in Germany for personal reasons I won't bore readers with. The best and brightest gather less than an hour from my home town to talk about one of my favourite subjects, and I can't be there. Talk about frustration!

How can they they possibly proceed without yours truly to interject "it sucks" at regular intervals? What, things are going just fine? Next, you'll be suggesting that Systematic Biology can function without me as editor … wait, what's that you say? Jack's running the show without a hitch … gack, I'm redundant.

Monday, October 15, 2007

Getting into Nature ... sort of


The kind people at Nature have taken pity on my rapidly fading research career, and have highlighted my note "Towards a Taxonomically Intelligent Phylogenetic Database" in Nature Precedings (doi:10.1038/npre.2007.1028.1) on the Nature web site. Frankly this is probably the only way I'll be getting into Nature...

Sunday, October 14, 2007

Pygmybrowse GBIF classification

Here is a live demo of Pygmybrowse using the Catalogue of Life classification of animals provided by GBIF. It's embedded in this post in an <iframe> tag, so you can play with it. Just click on a node.

Taxa in bold have ten or more children, the numbers of children are displayed in parentheses "()". Each subtree is fetched on the fly from GBIF.

Friday, October 12, 2007

Pygmybrowse revisited


As yet another example of avoiding what I should really be doing, a quick note about a reworked version of PygmyBrowse (see earlier posts here and here). Last September I put together a working demo written in PHP. I've now rewritten it entirely in Javascript, apart from PHP script that returns information about a node in a classification. For example, this link returns details about the Animalia in ITIS.

You can view the new version live. Going "view source" in your browser will show you the code. It's mostly a matter of Javacsript and CSS, with some AJAX thrown in (based on the article Dynamic HTML and XML: the XMLHttpRequest object on Apple's ADC web site).

One advantage of making it entirely in Javascript is that it can be easily integrated into web sites that don't use PHP. As an example, David Shorthouse has inserted a version into the species pages in The Neartic Spider database (for example, visit the page for Agelena labyrinthica and click on the little "Browse tree" link).

Thursday, October 11, 2007

Frog deja vu

While using iSpecies to explore some names (e.g., Sooglossus sechellensis in the Frost et al. amphibian tree mentioned in the last post, I stumbled across two papers that both described a new genus of frog for the same two taxa in the Seychelles. The papers (doi:10.1111/j.1095-8312.2007.00800.x and www.connotea.org/uri/6567cfd7531a77588ee62d78e7b4359b) were published within a couple of months of each other.

Was about to blog this, when I discovered that Christopher Taylor had beaten me to it with his post Sooglossidae: Deja vu all over again. Amongst the commentary on this post is a note by Darren Naish (now here) pointing to an interesting article by Jerald D. Harris entitled ‘Published Works’ in the electronic age: recommended amendments to Articles 8 and 9 of the Code, in which he states:
I propose to the Commission that, under Article 78.3 (‘Amendments to the Code’), Articles 8 and 9 of the current Code require both pro- and retroactive (to the effective date of the Fourth Edition, 1 January 2000) modification to accommodate the following issue: documents published electronically with DOI numbers and that are followed by hard-copy printing and distribution be exempt from Article 9.8 and be recognized as valid, citable sources of zoological taxonomic information and that their electronic publication dates be considered definitive.

It's an interesting read.

Visualising very big trees Part VI

I've tidied up the big phylogeny viewer mentioned earlier, and added a simple web form for anybody interested to upload a NEXUS or Newick tree and have a play.

Examples:


To create your own tree viewer, simply go to http://linnaeus.zoology.gla.ac.uk/~rpage/bigtrees/tv2/ and upload a tree. After some debugging code and images scroll past, a link to the widget appears at the bottom of the page. I'll tidy this all up when I get the chance, but for now it's good enough to play with.

Thursday, October 04, 2007

processing PhylOData (pPOD)


The first pPod workshop happened last month at NESCent, and some of the presentations are online on the pPod Wiki. Although I'm a "consultant" I couldn't be there, which is a pity because it looks to have been an interesting meeting. When pPod was first announced I blogged some of my own thoughts on phylogenetics databases. The first meeting had lots of interesting stuff on workflows and data integration, as well as outlining the problems faced by large-scale systematics. Some relevant links (partly here to as a reminder to myself to explore these further):

Thursday, September 27, 2007

Mesquite does Google Earth files


The latest version of the David and Wayne Maddison's Cartographer module for their program Mesquite can export KML files for Google Earth. They graciously acknowledge my crude efforts in this direction, and Bill Piel's work -- he really started this whole thing rolling.

So, those of you inspired to try your hand at Google Earth trees, and who were frustrated by the lack of tools should grab a copy of Mesquite and take it for a spin.

Wednesday, September 19, 2007

Parallels


Quick note to say how much fun it is to use Parallels Desktop. It's a great advantage to have Windows XP and Fedora Core 7 running on my Mac. As much as I dislike Internet Explorer, it caught some bugs in my code. It's always useful to try different environments when debugging code, either stand alone or for the Web.

Tuesday, September 18, 2007

Nature Precedings


Nature precedings is pre-publication server launched by Nature a few months ago. To quote from the website:
Nature Precedings is a place for researchers to share pre-publication research, unpublished manuscripts, presentations, posters, white papers, technical papers, supplementary findings, and other scientific documents. Submissions are screened by our professional curation team for relevance and quality, but are not subjected to peer review. We welcome high-quality contributions from biology, medicine (except clinical trials), chemistry and the earth sciences.

Unable to resist, I've uploaded three manuscripts previously languishing as "Technical Reports" on my old server. The three I uploaded now have bright shiny DOIs, which may take a little while to register with CrossRef. The manuscripts are:

Treemap Versus BPA (Again): A Response to Dowling doi:10.1038/npre.2007.1030.1 (a response to a critique of my ancient TreeMap program).

On The Dangers Of Aligning RNA Sequences Using "Conserved" Motifs doi:10.1038/npre.2007.1029.1 (a short note on Hickson et al.'s (2000) use of conserved motifs to evaulate RNA alignment).

Towards a Taxonomically Intelligent Phylogenetic Database doi:10.1038/npre.2007.1028.1 (a paper written for DBiBD 2005, basically a rewrite of a grant proposal).

All three are under the evolution and ecology subject heading. Visitors to Nature precedings can comment on papers, and vote for which ones they like. The fact that I've uploaded some manuscripts probably says nothing good about me, but I'll be checking every so often to see if anybody has anything to say...

Tuesday, September 11, 2007

Matching names in phylogeny data files

In an earlier post I described the TBMap database (doi:10.1186/1471-2105-8-158), which contains a mapping of TreeBASE taxon names onto names in other databases. While this is one step towards making it easier to query TreeBASE, what I'd really like is to link the data in TreeBASE to sources such as GenBank and specimen databases. Part of the challenge of doing this (and also doing it more generally, such as taking a NEXUS file from the web and using that) is that the names people use in a NEXUS data file are often not the actual taxonomic names. So, if I take a table from a paper that lists GenBank accession numbers, voucher specimens, etc., I'm left with the problem of matching two sets of names -- those in the data file to those in the table.

For example, this is a TreeBASE taxon Apomys_sp._B_145699. Using a script I grabbed the sequences in this study and constructed a table listing the name and specimen voucher for each sequence. The name corresponding to the TreeBASE taxon is Apomys "sibuyan b" FMNH 145699. Clearly a similar string, but not the same.

The approach I've taken is to compare strings using the longest common subsequence algorithm. Below are the two strings, with their longest common subsequence highlighted in green.


Apomys
_sp._B_145699


Apomys "sibuyan b" FMNH 145699

The length of this subsequence is used to compute a measure of distance between the two strings. If len1 is the length of the first string, len2 is the length of the second string, and lcs is the length of the longest common subsequence, then

d = (len1 + len2) - 2 × lcs

We can normalise this by dividing by len1 + len2, so that d ranges from 0 (identical) to 1.0 (no similarity).

So, now we have a measure of how similar the strings are, I still need to find the set of matchings between file and table names. This can be modelled an example of a maximum weight bipartite matching problem. Given a bipartite graph, we want to find the matching with the highest weight. A "matching" is where each node is connected to just one other node. For example, given this graph:


a maximum weight matching is:



In this example, four pairs of nodes are matched, one is "orphaned". Applying this to the data matching problem, I take a list of names form the NEXUS file, and a list of names from the paper (or supplementary data file, etc.) and compute a maximum weight matching. Because I'm looking for the maximum weighted matching I actually want similarity of names, hence I subtract the normalised d from 1.0.

So, the algorithm consists of taking the two lists of names (taxa from dataset and names from a table), computing the distance between all pairs of names, then obtain the maximum weight bipartite matching. Because for large sets of names the n × m the bipartite graph becomes huge, and because in practice most matches are poor, for each node in the graph I only draw the edges corresponding to the five most similar pairs of strings.

Thursday, August 30, 2007

RAxML BlackBox

Alexis Stamatakis and Jacques Rougemont have released RAxML BlackBox, a prototype Web-Server for RAxML which is attached to a 200 CPU-cluster located at the Vital-IT unit of the Swiss Institute of Bioinformatics. You can upload your data and the cluster will mul it over for up to 24 hours, so typically you can analyse alignments up to 1,000 to 1,500 sequences at present. The software itself can handle bigger data sets (see the RAxML page at CIPRES).

Alexis has also doe some work on phylogeny visualisation, including using treemaps. See (doi:10.1007/11573067_29, PDF here).

Monday, August 20, 2007

Internet Explorer -- argh!!!!!


I would prefer to avoid Microsoft-bashing, but today I've spent time trying to get my tree viewer to work under Internet Explorer 6 and 7, and it's hell. Here are the problems I've had to deal with:

Empty DIV bug
On IE 6 the top of the scrollbar overlapped the transparent area when the page first loads. Eventually discovered that this is a bug in IE. It gives empty DIVs a height approximately equal to the font-height for the DIV, even if the DIV has height:0px; (see here for a discussion). I set the CSS for this DIV to overflow:hidden;, and the DIV now behaves.

Opacity
The viewer makes use of opacity, that is, having DIVs that are coloured, but which you can see through. This enables me to add layers over the top of an image. IE doesn't support the standard way of doing this, so styles such as opacity:0.5; must also be written as filter:alpha(opacity=50); (thanks to David Shorthouse for pointing this out to me).

Background transparency
The DIV overlaying the big tree has background-color:transparent;, which means it refuses to accept any mouse clicks on the big tree. Changing the color to anything else meant the DIV received the clicks, so I ended up using a fairly ugly hack to include Internet Explorer-specific CSS for this DIV (idea borrowed from How to Use Different CSS Style Sheets For Different Browsers (and How to Hide CSS Code from Older Browsers)).

z-index bug
The final show stopper was the auto-complete drop down list of taxon names. On IE it disappeared behind the big tree. This is the infamous z-index bug. The drop down menu is a DIV created on the fly, and although it's z-index value (99) means it should be placed on top of the tree (so the user can see the list of taxa), it isn't. After some Googling I settled on the hack of setting the z-index for the DIV containing the big tree to -1 in IE only, and this seems to work.

IE is sometimes good
Sometimes it has to be said, IE has it's good points. The tree viewer failed in part because I'd failed to define a Javascript variable. Somehow FireFox and Safari were OK with this, but IE 6 broke. I defined the variable and it worked. I've also learnt to avoid some variable names, such as scroll. I find FireFox to a better browser for developing stuff, especially if the Firebug extension is installed. However, the Internet Explorer Developer Toolbar is useful if you need to figure out what IE is doing.

Conclusion
It's staggering how much time one can waste trying to cater to the weird and wonderful ways of Internet Explorer. However, the tree viewer should now work for those of you running Internet Explorer.

Saturday, August 18, 2007

Visualising very big trees, Part V

Inspired partly by the image viewers mentioned earlier, and tools like Google Finance's plot of stock prices, I've built yet another demo of one way to view large trees.


You can view the demo here. On the left is a thumbnail of the tree, on the right is the tree displayed "full scale", that is, you can read the labels of every leaf. In the middle appears a subset of any internal node labels. Top right is a text box in which you can search for a taxon in the tree.

You can navigate by dragging the scroll bar on the left, dragging the big tree, or using the mouse wheel (and you can jump to a taxon by name). It has been "tested" in Safari and Firefox on a Mac, I doubt it works on Internet Explorer. Getting that to happen is a whole other project.

The viewer is written entirely in HTML and Javascript, the underlying tree images (and some of the HTML and Javascript) are generated using a C++program that reads and draws trees, and I use ImageMagick to generate actaual images.

Friday, August 17, 2007

Bird supertree project - "Open Source" phylogenetics


Black Browed Albatross
Originally uploaded by QuestingBeast
Today is the day Katie Davis and I are launching the Bird Supertree Project. Partly an effort to distribute the task of building the tree, partly an experiment in "open source phylogenetics", we're curious (if not anxious) to see how this works out. We encourage anybody who is interested in constructing big trees to visit the site, grab the data and have a play. You can upload your results (and see who is the best tree builder), and view the trees using one of the methods for viewing big trees that I mentioned earlier. I'm frantically trying to improve this viewer, so the more trees we get the greater the incentive for me to improve it.

Wednesday, August 15, 2007

Expand-Ahead: A Space-Filling Strategy for Browsing Trees

For the "to do" list, expand-ahead browsing looks like a useful approach to build upon PygmyBrowse (see my live demo). The approach is described in "Expand-Ahead: A Space-Filling Strategy for Browsing Trees" by McGuffin et al. (doi:10.1109/INFOVIS.2004.21, PDF also here).

There is a video on Ravin Balakrishnan's site, which is an AVI file that I haven't bee able to coerce my Mac into playing, hence I've posted it on YouTube.

Saturday, August 11, 2007

Visualising very big trees, Part IV


Continuing the theme of viewing big trees, another approach to viewing large objects is tiling, which most people will have encountered if they've used Google Maps.The idea is to slice a large image into many smaller pieces ("tiles") at different reoslutions, and display only those tiles needed to show the view the user is interested in. I'd thought about doing this for trees but abandoned it. However, I think it is worth revisiting, based on discussion on the Nature Network Bioinformatics Forum, and looking at the Giant-Ass Image Viewer (version 2 is here), and Marc Paschal's blog.

As an example of what could be done, below is a phylogeny from Frost et al.'s "The amphibian tree of life" hdl:2246/5781, rendered using Zoomify's Zoomify Express. I just took a GIF I'd made of the entire tree, dropped it on the Zoomify Express icon, hacked some HTML, and got this:





Now, I don't think Zoomify itself is the answer, because what I'd like is to constrain the navigation to be in one dimension, to have a clearer sense of where I am in the tree, and to have a search function to locate nodes of interest. However, this approach seems worth having a look at. Looks like I'll need to learn a lot more Javascript...

Saturday, August 04, 2007

Visualising very big trees, Part III

I've refined my first efforts to now highlight where you are in the tree. The trees on display here now show the new look.

Basically I've abandoned image maps as they don't allow me to highlight the part of the tree being selected. After some fussing I switched to using HTML DIVs, which sit on top of the image. This took a little while to get working, CSS and DIV placement drives me nuts. The trick is to give each DIV the style position:absolute;, and (and this is important) make sure that the DIV is written as <div ...></div>, not <div .../>.

The trees now show a pale blue highlight when you mouse over an area you can click on, and if the corresponding subtree has an internal node label, that label is also highlighted. In the same way, if you mouse over the region on the right that corresponds to a labelled internal node, botht he label and the subtree are highlighted. I think this helps make it clear what parts you are selecting, and gives you the option of selecting using a name, rather than clicking on part of the tree.

Friday, August 03, 2007

Visualising very big trees, Part II

OK, time to put my money where my mouth is. Here's a first stab at displaying big trees in a browser. Not terribly sophisticated, but reasonably fast. Take a look at Big Trees.

Approach
Given a tree I simply draw it in a predetermined area (in these examples 400 x 600 pixels). If there are more leaves than can be drawn without overlapping I simply cull the leaf labels. If there are internal node labels I draw vertical lines corresponding to the span of the corresponding subtree, which is simply the range between the left-most and right-most decendants of that node. If internal node labels are nested (e.g., "Mammalia" and "Primates") I draw the most recent internal node label, the rationale being that I want only a single set of vertical bars. This gives the visual effect of partitioning up the leaves into non-overlapping sets. This gives us a diagram like this:


OK, but what about all the nodes we can't see? What I do here is make the tree "clickable" in the following way. If there are internal node labels I make the corresponding tree clickable. I also traverse the tree looking for well defined clusters -- basically subtrees that are isolated by a long branch from their nearest neighbours -- and make these clickable. This approach is partly a hang over form earlier experiments on automatically folding a tree (partly inspired by doi:10.1111/1467-8659.00235). The key point is I'm trying to avoid testing for mouse clicks on nodes and edges, as many of these will be ocluded by other nodes and edges, and it will also be expensive to do hit testing on nodes and edges in a big tree.

If you click on one the script extracts the subtree and reloads the display showing just that part of the tree, using exactly the same approach as above. Behind the scenes the code is doing a least common ancestor (LCA) query, hence it defines subtrees rather like the Phylocode does (oh the irony).

Pros
  • Reasonably fast (everything you see is done live "on the fly").
  • Works in any modern browser, no dependence on plugins or technology that has limited support.
  • Image is clear, text is small but legible.
  • Entirely automated layout


Cons
  • Reloading a new page is costly in terms of time, and potentially disorienting (you loose sense of the larger tree).
  • It is not obvious where to click on the tree (needs to be highlighted).
  • Text is not clickable. This is would be really useful for internal node labels.

Thursday, August 02, 2007

Viewing very large trees


One of the striking pictures in Tamara Munzner et al.'s paper "TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility" (doi:10.1145/882262.882291, also available here) is that of a biologist struggling to visualise a large phylogeny. The figure caption states that:
Biologists faced with inadequate tools for comparing large trees have fallen back on paper, tape, and highlighter pens.

I've been struggling with this problem in the context of display trees on a web page (see an earlier post). Viewing large trees has received a lot of attention, and there are some fun tools such as Tamara Munzner's TreeJuxtaposer and Mike Sanderson's Paloverde (doi:10.1093/bioinformatics/btl044 ), which was used to create the cover for the October 2006 issue of Systematic Biology. And let's not forget Google Earth.



The problem with standalone tools like these is that they are just that - standalone. They are meant to support interactive visualisation in an application, not viewing a tree on a web page. This is a particular problem facing TreeBASE. A user wanting to view, say, the marsupial supertree published by Cardillo et al. (doi:10.1017/S0952836904005539, TreeBASE study S1035) is greeted by the message:
This tree is too large to be seen using the usual GUI. We recommend that you view the tree using the java applet ATV or the program TreeView (see below). Alternatively, you can download the data matrix and view the tree(s) in MacClade, PAUP, or any other nexus-compatible software.
and the tree is displayed as Newick text string:
(((((((((((Abacetus, ((((Agonum, Glyptolenus), Europhilus, Tanystoma, Platynus), ((Morion, Moriosomus), Stenocrepis)), (((Licinus, Zargus, Badister), (Panagaeus, Tefflus)), Melanchiton)), ((Amara, Zabrus), (Harpalus, Dicheirotrichus, Parophonus, Trichocellus, Ophonus, Trichotichnus, Diachromus, Pseudoophonus, Stenolophus, Notobia, Bradycellus, Nesacinopus, Anisodactylus, Acupalpus, Acinopus, Xestonotus)), ((Anthia, Thermophilum), ((Corsyra, Discoptera), Graphipterus)), (((Apenes, (Chlaenius, Callistus)), Oodes), ((Calophaena, ((Ctenodactyla, Leptotrachelus), Galerita)), (Pseudaptinus, Zuphius))), (((Calleida, Hyboptera), Lebia), Cymindis, Demetrias, Dromius, Lionychus, Microlestes, Syntomus), ((Calybe, Lachnophorus), Odacantha), (Catapiesis, (Desera, Drypta)), Cnemalobus, (Coelostomus, (Eripus, Pelecium)), ...


Not the most compelling visualisation. What I hope to do in this and following posts is describe my own efforts to come to grips with this problem.

Requirements
To put the problem into perspective, what I'm looking for is a simple way to draw large trees for display in a web browser. This places severe limits on the kind of interactivity that is possible (unless we go down the root of Java applets, which I will avoid like the plague). This rules out, for example, trying to emulate TreeJuxtaposer's functionality. Initially I started looking at SVG, which renders graphics nicely, supports interaction, and being essentially an XML file, is easy to manipulate (for an example see my earlier post on SVG maps). However, SVG is not well supported in all browsers (FireFox does pretty well, most other browsers are variable). All browsers, however, support bitmap graphics (GIF, PNG, JPEG, etc.). When drawing complex things like trees bitmaps have some advantages, especially with regards to labelling. Small bitmap fonts tend to be more legible than anti-aliased fonts at the same size (see article at MiniFonts for background.

Animation
Comments so far on this post have focussed on animation (e.g., using Flash). Here is a video of TreeJuxtaposer taken from Tamara Munzner's web site.

For me the most interesting features of TreeJuxtaposer are that the entire tree is always visible (thus retaining context, unlike pan and zoom), and the user can select bits to view by drawing a rectangle on the screen. The processing to compute the transformations needed for large trees is fairly heavy duty, although newer algorithms have reduced this somewhat (see here).

Tuesday, July 24, 2007

I hate computers


The PC hosting linnaeus.zoology.gla.ac.uk and darwin.zoology.gla.ac.uk has died, and this spells the end of my interest in (a) using generic PC hardware and (b) running Linux. The former keeps breaking down, the later is just harder than it needs to be (much as I like the idea). From now on, it's Macs only. No more geeky knapsacks for me.

Because of this crash a lot of my experimental web sites are offline. I'm slowly putting some back up, driven mainly by what my current interests are, what people have asked for, or things linked to by my various blogs. So far, these pages are back online:

A lot of the other stuff will have to wait.

I hate computers...

Tuesday, July 17, 2007

LSID wars

Well, the LSID discussion has just exploded in the last few weeks. I touched on this in my earlier post Rethinking LSIDs versus HTTP URI, but since then the TDWG discussion has become more vigourous (albeit mainly focussed on technical issues, although I suspect that these are symptoms of a larger problem), while public-semweb-lifesci@w3.org list for July has mushroomed into a near slugfest of discussion about URIs, LSIDs, OWL, the goals of the Semantic Web, etc. There are also blog posts, such as Benjamin Good's The main problem with LSIDs, Mark Wilkinson's numerous posts on his blog, and Pierre Lindebaum's commentary.

I have no comment to make on this, I'm merely bookmarking them for when I find the time to wade through all this...

Friday, July 13, 2007

Phyloinformatics Workshop in Edinburgh October 22-24 2007


This October 22-24 there is a phyloinformatics workshop at the e-Science Institute in Edinburgh, Scotland, hosted in conjunction with the Isaac Newton Institute for Mathematical Sciences's Phylogenetics Programme.
As phylogenetics scales up to grapple with the tree of life, new informatics challenges have emerged. Some are essentially algorithmic - the underlying problem of inferring phylogeny is computationally very hard. Large trees not only pose computational problems, but can be hard to visualise and navigate efficiently. Methodological issues abound, such as what is the most efficient way to mine large databases for phylogenetic analysis, and is the "tree of life" the appropriate metaphor given evidence for extensive lateral gene transfer and hybridisation between different branches of the tree. Phylogenies themselves are intrinsically interesting, but their real utility to biologists comes when they are integrated with other data from genomics, geography, stratigraphy, ecology, and development. This poses informatics challenges, ranging from the more general problem of integrating diverse sources of biological data, to how best to store and query phylogenies. Can we express phylogenetic queries using existing database langauges, or is it time for a phylogenetic query language? All these topics can be gathered together under the heading "phyloinformatics". This workshop brings together researchers with backgrounds in biology, computer science, databasing, and mathematics. The aim is to survey the state of the art, present new results, and explore more closely the connections between these topics. The 3 day workshop will consist of 10 talks from invited experts (45 minutes each), plus 3 group discussion sessions (45 mins - 1 hour each). A poster session will be held in the middle of the meeting for investigators who wish to present their results, and there be also be time set aside for additional discussion and interaction.

The invited speakers are:

For more details visit the web site.

Sunday, June 10, 2007

Making taxonomic literature available online

Based on my recent experience developing an OpenURL service (described here, here, and here), linking this to a reference parser and AJAX tool (see David Shorthouse's description of how he did this), and thoughts on XMP, maybe it's time to try and articulate how this could be put together to make taxonomic literature more accessible.

Details below, but basically I think we could make major progress by:

  1. Creating an OpenURL service that knows about as many articles as possible, and has URLs for freely available digital versions of those articles.

  2. Creating a database of articles that the OpenURL service can use.

  3. Creating tools to populate this database, such as bibliography parsers (http://bioguid.info/references is a simple start).

  4. Assigning GUIDs to references.



Background

Probably the single biggest impediment to basic taxonomic research is lack of access to the literature. Why isn't it the case that if I'm looking at a species in a database or a search engine result, I can't click on the original description of that species? (See my earlier grumble about this). Large-scale efforts like the Biodiversity Heritage Library (BHL) will help, but the current focus on old (pre-1923) literature will severely limit the utility of this project. Furthermore, BHL doesn't seem to have dealt with the GUID issue, or the findability issue (how do I know that BHL has a particular paper?).

My own view is that most of this stuff is quite straightforward to deal with, using existing technology and standards, such as OpenURL and SICIs. The major limitation is availability of content, but there is a lot of stuff out there, if we know where to look.

GUIDs

Publications need GUIDs, globally unique identifiers that we can use to identifier papers. There are several kinds of GUID already being used, such as DOIs and Handles. As a general GUID for articles, I've been advocating SICIs.

For example, the Nascimento et al. paper I discussed in an earlier post on frog names has the SICI 0365-4508(2005)63<297>2.0.CO;2-2. This SICI comprises the ISSN number of the serial (in this example 0365-4508 is the ISSN for Arquivos do Museu Nacional, Rio de Janeiro), the year of publication (2005), volume (63), and the starting page (297), plus various other bits of administrivia such as check digits. For most articles this combination of four elements is enough to uniquely define an article.

SICIs can be generated easily, and are free (unlike DOIs). They don't have a resolution mechanism, but one could add support for them to an OpenURL resolver. For more details on SICIs I've some bookmarks on del.icio.us.

Publisher archiving

A number of scientific societies and museums have literature online already, some of which I've made use of already (e.g., the AMNH's Bulletins and Novitates, and the journal Psyche). My OpenURL service knows about some 8000 articles, based on a few days work. But my sense is that there is much more out there. All this needs is some simple web crawling and parsing to build up a database that an OpenURL service can use.

Personal and communal archiving

Another, complementary approach is for people to upload papers in their collection (for example, papers they have authored or have digitised). There are now sites that make this as easy as uploading photos. For example, Scribd is a Flickr-like site where you can upload, tag, and share documents. As a quick test, I uploaded Nascimento et al., which you can see here: http://www.scribd.com/doc/99567/Nascimento-p-297320. Scribd uses Macromedia Flashpaper to display documents.

Many academic institutions have their own archiving programs, such as ePRINTS (see for example ePrints at my own institution).

The trick here is link these to an OpenURL service. Perhaps the taxonomic community should think about a service very like Scribd, which will at the same time update the OpenURL service everytime an article becomes available.

Summary

I suspect that none of this is terribly hard, most of the issues have already been solved, it's just a case of gluing the bits together. I also think it's a case of keeping things simple, and resisting the temptation to make large-scale portals, etc. It's a matter of simple services that can be easily used by lots of people. I this way, I think the imaginative way David Shorthouse made my reference parser and OpenURL service trivial for just abut anybody to use is a model for how we will make progress.

Saturday, June 09, 2007

Rethinking LSIDs versus HTTP URI

The TDWG-GUID mailing list for this month has a discussion of whether TDWG should commit to LSIDs as the GUID of choice. Since the first GUID workshop TDWG has pretty much been going down this route, despite a growing chorus of voices (including mine) that LSIDs are not first class citizens of the Web, and don't play well with the Semantic Web.

Leaving aside political considerations (this stuff needs to be implemented as soon as possible, concerns that if TDWG advocates HTTP URIs people will just treat them as URLs and miss the significance of persistence and RDF, worries that biodiversity will be ghettoised if it doesn't conform what is going on elsewhere), I think there is a way to resolve this that may keep most people happy (or at least, they could live with it). My perspective is driven by trying to separate needs of primary data providers from application developers, and issues of digital preservation.

I'll try and spell out the argument below, but to cut to the chase, I will argue

  1. A GUID system needs to provide a globally unique identifier for an object, and a means of retrieving information about that object.

  2. Any of the current technologies we've discussed (LSIDs, DOIs, Handles) do this (to varying degrees), hence any would do as a GUID.

  3. Most applications that use these GUIDs will use Semantic Web tools, and hence will use HTTP URIs.

  4. These HTTP URIs will be unique to the application, the GUIDs however will be shared

  5. No third party application can serve an HTTP URI that doesn't belong to its domain.

  6. Digital preservation will rely on widely distributed copies of data, these cannot have the same HTTP URI.


From this I think that both parties to this debate are right, and we will end up using both LSIDs and HTTP URIs, and that's OK. Application developers will use HTTP URIs, but will use clients that can handle the various kinds of GUIDs. Data providers will use the GUID technology that is easiest for them to get up and running (for specimen this is likely to be LSIDs, for literature some providers may use Handles via DSpace, some may use URLs).

Individual objects get GUIDs

If individual objects get GUIDs, then this has implications for HTTP URIs. If the HTTP URI is the GUID, an object can only be served from one place. It may be cached elsewhere, but that cached copy can't have the same HTTP URI. Any database that makes use of the HTTP URI cannot serve that HTTP URI itself, it needs to refer to it in some way. This being the case, whether the GUID is a HTTP URI or not starts to look a lot less important, because there is only one place we can get the original data from -- the original data provider. Any application that builds on this data will need it's own identifier if people are going to make use of that application's output.

Connotea as an example

As a concrete example, consider Connotea. This application uses deferenceable GUIDs such as DOIs and Pubmed ids to retrieve publications. DOIs and Pubmed ids are not HTTP URIs, and hence aren't first class citizens of the Web. But Connotea serves its own records as HTTP URIs, and URIs with the prefix "rss" return RDF (like this) and hence can be used "as is" by Semantic Web tools such as Sparql.

If we look at some Connotea RDF, we see that it contains the original DOIs and Pubmed ids.


This means that if two Connotea users bookmark the same paper, we could deduce that they are the same paper by comparing the embedded GUIDs. In the same way, we could combine RDF from Connotea and another application (such as bioGUID) that has information on the same paper. Why not use the original GUIDs? Well, for starters there are two of them (info:pmid/17079492 and info:doi/10.1073/pnas.0605858103) so which to use? Secondly, they aren't HTTP URIs, and if they were we'd go straight to CrossRef or NCBI, not Connotea. Lastly, we loose the important information that the bookmarks are different -- they were made by two different people (or agents).

Applications will use HTTP URIs

We want to go to Connotea (and Connotea wants us to go to it) because it gives us additional information, such as the tags added by users. Likewise, bioGUID adds links to sequences referred to in the paper. Web applications that build on GUIDs want to add value, and need to add value partly because the quality of the original data may suck. For example, metadata provided by CrossRef is limited, DiGIR providers manage to mangle even basic things like dates, and in my experience many records provided by DiGIR sources that lack geocoordinates have, in fact, been georeferenced (based on reading papers about those specimens). The metadata associated with Handles is often appallingly bad, and don't get me started on what utter gibberish GenBank has in its specimen voucher fields.

Hence, applications will want to edit much of this data to correct and improve it, and to make that edited version available they will need their own identifiers, i.e. HTTP URIs. This ranges from social bookmarking tools like Connotea, to massive databases like FreeBase.

Digital durability

Digital preservation is also relevant. How do we ensure our digital records are durable? Well, we can't ensure this (see Clay Shirky's talk at LongNow), but one way to make them more durable is massive redundancy -- multiple copies in many places. Indeed, given the limited functionality of the current GBIF portal, I would argue that GBIFs main role at present is to make specimen data more durable. DiGIR providers are not online 24/7, but if their data are in GBIF those data are still available. Of course, GBIF could not use the same GUID as the URI for that data, like Connotea it would have to store the original GUID in the GBIF copy of the record.

In the same way, the taxonomic literature of ants is unlikely to disappear anytime soon, because a single paper can be in multiple places. For example, Engel et al.'s paper on ants in Cretaceous Amber is available in at least four places:

Which of the four HTTP URIs you can click on should be the GUID for this paper? -- none of them.

LSIDs and the Semantic Web

LSIDs don't play well with the Semantic Web. My feeling is that we should just accept this and move on. I suspect that most users will not interact directly with LSID servers, they will use applications and portals, and these will serve HTTP URIs which are ideal for Semantic Web applications. Efforts to make LSIDs compliant by inserting owl:sameAs statements and rewriting rdf:resource attributes using a HTTP proxy seem to me to be misguided, if for no other reason than one of the strengths of the LSID protocol (no single point of failure, other than the DNS) is massively compromised because if the HTTP proxy goes down (or if the domain name tdwg.org is sold) links between the LSID metadata records will break.

Having a service such as a HTTP proxy that can resolve LSIDs on the fly and rewrite the metadata to become HTTP-resolvable is fine, but to impose an ugly (and possibly short term) hack on the data providers strikes me as unwise. The only reason for attempting this is if we think the original LSID record will be used directly by Semantic web applications. I would argue that in reality, such applications may harvest these records, but they will make them available to others as part of a record with a HTTP URI (see Connotea example).

Conclusions


I think my concerns about LSIDs (and I was an early advocate of LSIDs, see doi:10.1186/1471-2105-6-48) stem from trying to marry them to the Semantic web, which seems the obvious technology for constructing applications to query lots of distributed metadata. But I wonder if the mantra of "dereferenceable identifiers" can sometime get in the way. ISBNs given to books are not, of themselves, dereferenceable, but serve very well as identifiers of books (same ISBN, same book), and there are tools that can retrieve metadata given an ISBN (e.g., LibraryThing).

In a world of multiple GUIDs for the same thing, and multiple applications wanting to talk about the same thing, I think clearly separating identifiers from HTTP URIs is useful. For an application such as Connotea, a data aggregator such GBIF, a database like FreeBase, or a repository like the Internet Archive, HTTP URIs are the obvious choice (If I use a Connotea HTTP URI I want Connotea's data on a particular paper). For GUID providers, there may be other issues to consider.

Note that I'm not saying that we can't use HTTP URIs as GUIDs. In some, perhaps many cases they may well be the best option as they are easy to set up. It's just that I accept that not all GUIDs need be HTTP URIs. Given the arguments above, I think the key thing is to have stable identifiers for which we can retrieve associated metadata. Data providers can focus on providing those, application developers can focus on linking them and their associated metadata together, and repackaging the results for consumption by the cloud.