Friday, March 24, 2017

This is what phylodiversity looks like

Following on from earlier posts exploring how to map DNA barcodes and putting barcodes into GBIF it's time to think about taking advantage of what makes barcodes different from typical occurrence data. At present GBIF displays data as dots on a map (as do I in But barcodes come with a lot more information than that. I'm interested in exploring how we might measure and visualise biodiversity using just sequences.

Based on a talk by Zachary Tong (Going Organic - Genomic sequencing in Elasticsearch) I've started to play with n-gram searches on DNA barcodes using Elasticsearch, an open source search engine. The idea is that we break the DNA sequence into every possible "word" of length n (also called a k-mer or k-tuple, where k = n).

For example, for n = 5, the sequence GTATCGGTAACGAACTT would look like this:



The sequence GTATCGGTAACGAACTT comes from Hajibabaei and Singer (2009) who discussed "Googling" DNA sequences using search engines (see also Kuksa and Pavlovic, 2009). If we index sequences in this way then we can do BLAST-like searches very quickly using Elasticsearch. This means it's feasible to take a DNA barcode and ask "what sequences look like this?" and return an answer qucikly enoigh for a user not to get bored waiting.

Another nice feature of Elasticsearch is that it supports geospatial queries, so we can ask for, say, all the barcodes in a particular region. Having got such a list, what we really want is not a list of sequences but a phylogenetic tree. Traditionally this can be a time consuming operation, we have to take the sequences, align them, then input that alignment into a tree building algorithm. Or do we?

There's growing interest in "alignment-free" phylogenetics, a phrase I'd heard but not really followed up. Yang and Zhang (2008) described an approach where every sequences is encoded as a vector of all possible k-tuples. For DNA sequences k = 5 there are 45 = 1024 possible combinations of the bases A, C, G, and T, so a sequence is represented as a vector with 1024 elements, each one is the frequency of the corresponding 5-tuple. The "distance" between two sequences is the mathematical distance between these vectors for the two sequences. Hence we no longer need to align the sequences being comapred, we simply chunk them into all "words" of 5 bases in length, and compare the frequencies of the 1024 different possible "words".

In their study Yang and Zhang (2008) found that:

We compared tuples of different sizes and found that tuple size 5 combines both performance speed and accuracy; tuples of shorter lengths contain less information and include more randomness; tuples of longer lengths contain more information and less random- ness, but the vector size expands exponentially and gets too large and computationally inefficient.

So we can use the same word size for both Elasticsearch indexing and for computing the distance matrix. We still need to create a tree, for which we could use something quick like neighbour-joining (NJ). This method is sufficiently quick to be available in Javascript and hence can be computed by a web browser (e.g., biosustain/neighbor-joining).

Putting this all together, I've built a rough-and-ready demo that takes some DNA barcodes, puts them on a map, then enables you to draw a box on a map and the demo will retrieve the DNA barcodes in that area, compute a distance matrix using 5-tuples, then build a NJ tree, all on the fly in your web browser.

This is all very crude, and I need to explore scalability (at the moment I limit the results to the first 200 DNA sequences found), but it's encouraging. I like the idea that, in principle, we could go to any part of the globe, ask "what's there?" and get back a phylogenetic tree for the DNA barcodes in that area.

This also means that we could start exploring phylogenetic diversity using DNA barcodes, as Faith & Baker (2006) wanted a decade ago:

...PD has been advocated as a way to make the best-possible use of the wealth of new data expected from large-scale DNA “barcoding” programs. This prospect raises interesting bio-informatics issues (discussed below), including how to link multiple sources of evidence for phylogenetic inference, and how to create a web-based linking of PD assessments to the barcode–of-life database (BoLD).

The phylogenetic diversity of an area is essentially the length of the tree of DNA barcodes, so if we build a tree we have a measure of diversity. Note that this contrasts with other approaches, such as Miraldo et al.'s "An Anthropocene map of genetic diversity" which measured genetic diversity within species but not between (!).

Practical issues

There are a bunch of practical issues to work through, such as how scalable it is to compute phylogenies using Javascript on the fly. For example, could we do something like generate a one degree by one degree grid of the Earth, take all the barcodes in each cell and compute a phylogeny for each cell? Could we do this in CouchDB? What about sampling, should we be taking a finite, random sample of sequences so that we try and avoid sampling bias?

There are also data management issues. I'm exploring downloading DNA barcodes, creating a Darwin Core Archive file using the Global Genome Biodiversity Network (GGBN) data standard, then converting the Darwin Core Archive into JSON and sending that to Elasticsearch. The reason for the intermediate step of creating the archive is so that we can edit the data, add missing geospatial informations, etc. I envisage having a set of archives, hosted say on GitHub. These archives could also be directly imported into GBIF, ready for the time that GBIF can handle genomic data.


  • Faith, D. P., & Baker, A. M. (2006). Phylogenetic diversity (PD) and biodiversity conservation: some bioinformatics challenges. Evol Bioinform Online. 2006; 2: 121–128. PMC2674678
  • Hajibabaei, M., & Singer, G. A. (2009). Googling DNA sequences on the World Wide Web. BMC Bioinformatics. Springer Nature.
  • Kuksa, P., & Pavlovic, V. (2009). Efficient alignment-free DNA barcode analytics. BMC Bioinformatics. Springer Nature.
  • Miraldo, A., Li, S., Borregaard, M. K., Florez-Rodriguez, A., Gopalakrishnan, S., Rizvanovic, M., … Nogues-Bravo, D. (2016, September 29). An Anthropocene map of genetic diversity. Science. American Association for the Advancement of Science (AAAS).
  • Yang, K., & Zhang, L. (2008, January 10). Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Research. Oxford University Press (OUP).

Notes for WikiCite 2017: Wikispecies reference parsing

Wikispecies logo svg In preparation for WikiCite 2017 I'm looking more closely at extracting bibliographic information from Wikispecies. The WikiCite project "is a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects". One reason for doing this is so that each factual statement in WikiData can be linked to evidence for that statement. Practical efforts towards this goal include tools to add details of articles from CrossRef and PubMed straight into Wikidata, and tools to extract citations from Wikipedia (as these are likely to be sources of evidence for statements made in Wikipedia articles).

Wikispecies occupies a rather isoldated spot in the Wiikipedia landscape. Unlike other sites which are essentially comprehensive encyclopedias in different languages, Wikispecies focusses on one domain - taxonomy. In a sense, it's a prototype of Wikidata in that it provides basic facts (who described what species when, and what is the classification of those species) that in principle can be reused by any of the other wikis. However, in practice this doesn't seem to have happened much.

What Wikispecies has become, however, is a crowd-sourced database of the taxonomic literture. For someone like me who is desparately gathering up bibliographic data so that I can extract articles from the Biodiversity Heritage Library (BHL), this is a potential goldmine. But, there's a catch. Unlike, say, the English language Wikipedia which has a single widely-used template for describing a publication, Wikispecies has it's own method of representing articles. It uses a somewhat confusing mix of templates for author names, and then uses barely standardised formatting rules to mark out parts of a publication (such as journal, volume, issue, etc.). Instead of a single template to describe a publication, in Wikispecies a publication my itself be described by a unique template. This has some advantages, in that the same reference can be transcluded into multiple articles (in other words, you enter the bibliographic details once). But this leaves us with many individual templates with multiple, idiosyncratic styles of representing bibliographic data. Some have tried to get the Wikispecies community to adopt the same template as Wikipedia (see e.g., this discussion) but this proposal has met with a lot of resistance. From my perspective as a potential consumer of data, the current situation in Wikispecies is frustrating, but the reality is that the people who create the content get to decide how they structure that content. And understandably, they are less than impressed by requests that might help others (such as data miners) at the expense of making their own work more difficult.

In summary, if I want to make use of Wikispecies I am going to need to develop a set of parsers than can make a reasonable fist of parsing all the myriad citation formats used in Wikispecies (my first attempts are on GitHub). I'm looking at parsing the references and converting them to a more standard format in JSON (I've made some notes on various bibliographic formats in JSON such as BibJSON and CSL-JSON). One outcome of this work will be, I hope, more articles discovered in BHL and hence added to BioStor), and more links to identifiers, which could be fed back into Wikispecies. I also want to explore linking the authors of these papers to identifiers, as already sketched out in The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor.