Sunday, April 26, 2009

H1N1 Swine Flu TimeMap

Tweets from @ attilacsordas and @stew alerted me to the Google Map of the H1N1 Swine Flu outbreak by niman.

Ryan Schenk commented: "It'd be a million times more useful if that map was hooked into a timeline so you could see the spread.", which inspired me to knock together a timemap of swine flu. The timemap takes the RSS feed from niman's map and generates a timemap using Nick Rabinowitz's Timemap library.



Gotcha
Although in principle this should have been a trivial exercise (cutting and pasting into existing examples), it wasn't quite so straightforward. The Google Maps RSS feed is a GeoRSS feed, but initially I couldn't get Timemap to accept it. The contents of the <georss:point> tag in the Google Maps feed looks like this:

<georss:point>
33.041477 -116.894531
</georss:point>

Turns out there's a minor bug in the file timemap.js, which I fixed by adding coords= TimeMap.trim(coords); before line 1369. The contents of the <georss:point> taginclude leading white space, and because timemap.js splits the latitude and longitude using whitespace, Google's feed breaks the code.

Postscript
Nick Rabinowitz has fixed this bug.

Tuesday, April 21, 2009

GBIF and Handles: admitting that "distributed" begets "centralized"

The problem with this ... is that my personal and unfashionable observation is that “distributed” begets “centralized.” For every distributed service created, we’ve then had to create a centralized service to make it useable again (ICANN, Google, Pirate Bay, CrossRef, DOAJ, ticTocs, WorldCat, etc.).
--Geoffrey Bilder interviewed by Martin Fenner

Thinking about the GUID mess in biodiversity informatics, stumbling across some documents about the PILIN (Persistent Identifier Linking INfrastructure) project, and still smarting from problems getting hold of specimen data, I thought I'd try and articulate one solution.

Firstly, I think biodiversity informatics has made the same mistake as digital librarians in thinking that people care where the get information from. We don't, in the sense that I don't care whether I get the information from Google or my local library, I just want the information. In this context local is irrelevant. Nor do I care about individual collections. I care about particular taxa, or particular areas, but not collections (likewise, I may care about philosophy, but not philosophy books at Glasgow University Library). I think the concern for local has lead to an emphasis on providing complex software to each data provider that supports operations (such as search) that don't scale (live federated search simply doesn't work), at the expense of focussing on simple solutions that are easy to use.

In a (no doubt unsuccessful) attempt to think beyond what I want, let's imagine we have several people/organisations with interests in this area. For example:

Imagine I am an occasional user. I see a specimen referred to, say a holotype, I want to learn more about that specimen. Is there some identifier I can use to find out more. I'm used to using DOIs to retrieve papers, what about specimens. So, I want:
  1. identifiers for specimens so I can retrieve more information

Imagine I am a publisher (which can be anything from a major commercial publisher to a blogger). I want to make my content more useful to my readers, and I've noticed that other's are doing this so I better get onboard. But I don't want to clutter my content with fragile links -- and if a link breaks I want it fixed, or I want a cached copy (hence the use of WebCite by some publishers). If I want a link fixed I don't want to have to chase up individual providers, I want one place to go (as I do for references if a DOI breaks). So, I want:
  1. stable links with some guarantee of persistence
  2. somebody who will take responsibility to fix the broken ones

Imagine I am a data provider. I want to make my data available, but I want something simple to put in place (I have better things to do with my time, and my IT department keep a tight grip on the servers). I would also like to be able to show my masters that this is a good thing to do, for example by being able to present statistics on how many times my data has been accessed. I'd like identifiers that are meaningful to me (maybe carry some local "branding"). I might not be so keen on some central agency serving all my data as if it was theirs. So, I want
  1. simplicity
  2. option to serve my own data with my own identifiers

Imagine I am an power user. I want lots of data, maybe grouped in ways that the data providers hadn't anticipated. I'm in a hurry, so I want to get this stuff quickly. So I want:
  1. convenient, fast APIs to fetch data
  2. flexible search interfaces would be nice, but I may just download it myself because it's probably quicker if I do it myself

Imagine I am an aggregator. I want data providers to have a simple harvesting interface so that I can grab the data. I don't need a search interface to their data because I can do it much faster if I have the data locally (federated search sucks). So I want:
  1. the ability to harvest all the data ("all your data are belong to me")
  2. a simple way to update my copy of provider's data when it changes


It's too late in the evening for me to do this justice, but I think a reasonable solution is this:
  1. Individual data providers serve their data via URLs, ideally serving a combination of HTML and RDF (i.e., linked data), but XML would be OK
  2. Each record (e.g., specimen) has an identifier that is locally unique, and the identifier is resolvable (for example, by simply appending it to a URL)
  3. Each data provider is encouraged to reuse existing GUIDs wherever possible, (e.g., for literature (DOIs) and taxonomic names) to make their data "meshable"
  4. Data provider can be harvested, either completey, or for records modified after a given date
  5. A central aggregator (e.g., GBIF) aggregates all specimen/observation data. It uses Handles (or DOIs) to create GUIDs, comprising a naming authority (one for each data provider), and an identifier (supplied by the data provider, may carry branding, e.g. "antweb:casent0100367"), so an example would be "hdl:1234567/antweb:casent0100367" or "doi:10.1234/antweb:casent0100367". Note that this avoids labeling these GUIDs as, say, http://gbif.org/1234567/antweb:casent0100367
  6. Handles resolve to data provider URL, but cached aggregator copy of metadata may be used if data provide is offline
  7. Publishers use "hdl:1234567/antweb:casent0100367" (i.e., authors use this when writing manuscripts), as they can harass central aggregator if they break
  8. Central aggregator is reponsible for generating reports to providers of how there data has been used, e.g. how many times "cited" in literaure

So, GBIF (for whoever steps up to the plate) would use handles (or DOIs). This gives them the tools to manage the identifiers, plus tells the world that we are serious about this. Publishers can trust that the links to millions of specimen records won't disappear. Providers don't have complex software to install, removing one barrier to making more data available.

I think it's time we made a serious effort to address these issues.

CrossRef fail - at least we're not alone...

CrossRef has been having some issues with it's OpenURL resolver over the weekend, which means that attempts to retrieve metadata from a DOI, or to find a DOI from metadata, have been thwarted. While annoying (see The dangers of the ‘free’ cloud: The Case of CrossRef), in one sense it's reassuring that it's not just biodiversity data providers that are having problems with service availability.

Monday, April 20, 2009

Connotea tags

For fun I quickly programmed a little tool for bioGUID that makes use of Connotea's web api. When an article is displayed, the page loads a Javascript script that makes a call to a simple web service that looks up a reference in Connotea and displays a tag cloud if the reference is found. For example, the paper announcing Zoobank (doi:10.1038/437477a) looks like this:


The reference has been bookmarked by 6 people, using 15 tags, some more popular than others. The tags and users are linked to Connotea.

This service is can be accessed at http://bioguid.info/services/connotea.php?uri=<doi here>, for example http://bioguid.info/services/connotea.php?uri=doi:10.1038/437477a. By default it returns JSON (you can also set the name of the callback function by add a &callback= parameter), but you can get HTML by adding &format=html. The HTML is also included in the JSON result, if you want to quickly display something, rather than roll your own.

Basically the service takes the DOI you supply, converts it to an MD5 hash, then looks it up in Connotea. There were a few little "gotcha's", such as the fact that the Connotea user may have bookmarked "doi:10.1038/43747" or the proxied version "http://dx.doi.org/10.1038/43747", and these have different MD5 hashes. My service tries both variations and merges the results.

Accessing specimens using TAPIR or, why do we make this so hard?

OK, second rant of the day. One of my favourite online specimen databases is AntWeb. For a while the ability to harvest data from this database using the venerable DiGIR protocol hasn't been possible, due to various issues at the California Academy of Sciences. Well, now it's back, and "accessible" using TAPIR (TAPIR - TDWG Access Protocol for Information Retrieval). Accessible, that is, if you like horrifically over-engineered, poorly documented standards. OK, at lot of work has gone into TAPIR, there's lots of great code on SourceForge, and there's lots of documentation, but I've really struggled to get the most basic tasks done.

For example, let's imagine I and want to retrieve the information on the ant specimen CASENT0100367 (note how trivial this is via a web browser, just append the specimen name to http://www.antweb.org/specimen.do?name=). After much clenching of teeth struggling with the TAPIR documentation and the TAPIR client software, I finally found an email by Markus Döring that gave me the clue. If I'm going to construct a URL to retrieve this specimen record, I need to include the URL of an XML document that serves as a template for the query. Since one doesn't exist, I have to create it and make it accessible to the TAPIR server (i.e., the AntWeb TAPIR server needs to access it, so I have to place this XML document on my web server). The template (shown below) lives at http://bioguid.info/tapir/dwc_catalog_number.xml:

<?xml version="1.0" encoding="UTF-8"?>
<searchTemplate xmlns="http://rs.tdwg.org/tapir/1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xsi:schemaLocation="http://rs.tdwg.org/tapir/1.0
http://rs.tdwg.org/tapir/1.0/schema/tapir.xsd
http://www.w3.org/2001/XMLSchema
http://www.w3.org/2001/XMLSchema.xsd">
<label>Scientific name in query</label>
<documentation>Query for a Scientific Name. Based on http://rs.tdwg.org/tapir/cs/dwc/1.4/template/dwc_sci_name_range.xml, found in email by Markus Döring http://lists.tdwg.org/pipermail/tdwg-tapir/2008-April/000493.html</documentation>
<externalOutputModel location="http://rs.tdwg.org/tapir/cs/dwc/1.4/model/dw_core_geo_cur.xml"/>
<filter>
<equals>
<concept id="http://rs.tdwg.org/dwc/dwcore/CatalogNumber" />
<parameter name="name"/>
</equals>
</filter>
</searchTemplate>

Now I can write my query: http://www.antweb.org/tapirlink/www/tapir.php/antweb
op=search
&start=0
&limit=1
&template=http://bioguid.info/tapir/dwc_catalog_number.xml
&name=casent0100367

So, the AntWeb server is going to read this query, and call my web server to get the query template to figure out what I actually want. Am I the only person who thinks that this is insane? Can anybody imagine going through these hoops to access a GenBank record, or a PubMed record?

Perhaps it's me, and my obsession with linking individual data records (rather than harvested lots of records, or federated search). But it strikes me that harvesting is a simple task and not many people will be doing it (at least, not on the scale of GBIF), and federated search is a non-starter as our community can't keep data providers online to save themselves.

In many ways I think TAPIR (and DiGIR before it) missed what for me is the most basic use case, namely I have a specimen identifier and I want to get the record for that specimen. These services make it much harder than it needs to be. It's a symptom of our field's inability to deliver simple tools that do basic tasks well, rather than overly general and highly complex tools that are poorly documented. Of course, retrieving individual records woud be easy if we have resolvable GUIDs for specimens, but we've singularly failed to deliver that, so we are stuck with very clunky tools. There's got to be a better way...

Semantic Publishing: towards real integration by linking

PLoS Computational Biolgy has recently published "Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article" (doi:10.1371/journal.pcbi.1000361) by David Shotton and colleagues. As a proof of concept, they took Reis et al. (doi:10.1371/journal.pntd.0000228) and "semantically enhanced" it:
These semantic enhancements include provision of live DOIs and hyperlinks; semantic markup of textual terms, with links to relevant third-party information resources; interactive figures; a re-orderable reference list; a document summary containing a study summary, a tag cloud, and a citation analysis; and two novel types of semantic enrichment: the first, a Supporting Claims Tooltip to permit “Citations in Context”, and the second, Tag Trees that bring together semantically related terms. In addition, we have published downloadable spreadsheets containing data from within tables and figures, have enriched these with provenance information, and have demonstrated various types of data fusion (mashups) with results from other research articles and with Google Maps.
The enhanced article is here: doi:10.1371/journal.pntd.0000228.x001. For background on these enhancements, see also David's companion article "Semantic publishing: the coming revolution in scientific journal publishing" (doi:10.1087/2009202, PDF preprint available here). The process is summarised in the figure below (Fig. 10 from Shotton et al., doi:10.1371/journal.pcbi.1000361.g010).



While there is lots of cool stuff here (see also Elsevier's Article 2.0 Contest, and the Grand Chalenge, for which David is one of the judges), I have a couple of reservations.

The unique role of the journal article?

Shotton et al. argue for a clear distinction between journal article and database, in contrast to the view articulated by Philip Bourne (doi:10.1371/journal.pcbi.0​010034) that there's really no difference between a database and a journal article and that the two are converging. I tend to favour the later viewpoint. Indeed, as I argued in my Elsevier Challenge entry (doi:10.1038/npre.2008.2579.1), I think we should publish articles (and indeed data) as wikis, so that we can fix the inevitable error. We can always roll back to the original version if we want to see the author's original paper.

Real linking

But my real concern is that the example presented is essentially "integration by linking", that is, the semantically enhanced version gives us lots of links to other information, but these are regular hyperlinks to web pages. So, essentially we've gone from pre-web documents with no links, to documents where the bibliography is hyperlinked (most online journals), to documents where both the bibliography and some terms in the text are hyperlinked (a few journals, plus the Shotton et al. example). I'm a tad underwhelmed.
What bothers me about this is:
  1. The links are to web pages, so it will be hard to do computation on these (unless the web page has easily retrievable metadata)
  2. There is no reciprocal linking -- the resource being linked to doesn't know it is the target of the link


Web pages are for humans

The first concern is that the marked-up article is largely intended for human readers. Yes, there are associated metadata files in RDF N3, but the core "added value" is really only of use to humans. For it to be of use to a computer, the links would have to go to resource that the computer can understand. A human clicking on many of the links will get a web page and they can interpret that, but computers are thick and they need a little help. For example, one hyperlinked term is Leptospira spirochete, linked to the uBio namebank record (click on the link to see it). The link resolves to a web page, so it's not much use to a computer (unless if has a scrapper for uBio HTML). Ironically, uBio serves LSIDs, so we could retrieve RDF metadata for this name (urn:lsid:ubio.org:namebank:255659), but there's nothing in the uBio web page that tells the computer that.

Of course, Shotton et al. aren't responsible for the fact that most web pages aren't easily interpreted by computers, but simply embedding links to web pages isn't a big leap forward. What could they have done instead? One approach is to link to resources that are computer-readable. For example, instead of linking the term "Oswaldo Cruz Foundation" to that organisation's home page (http://www.fiocruz.br/cgi/cgilua.exe/sys/start.htm?tpl=home), why not use the DBpedia URI http://dbpedia.org/page/Instituto_Oswaldo_Cruz? Now we get both a human-readable page, and extensive RDF that a computer can use. In other words, if we crawl the semantically enhanced PLoS article with a program, I want to be able to have that crawler follow the links and still get useful information, not the dead end of a HTML web page. Quite a few of the institutions listed in the enhanced paper have DBPedia URIs:


Why does this matter? Well, if you use DBPedia URIs you get RDF, plus you get connections with the Linked Data crowd, who are rapidly linking diverse data sets together:


I think this is where we need to be headed, and with a little extra effort we can get there, once we move on from thinking solely about human readers.

An alternative approach (and one that I played with in my Challenge entry, as well as my ongoing wiki efforts) is to create what Vandervalk et al. term a "semantic warehouse" (doi:10.1093/bib/bbn051). Information about each object of interest is stored locally, so that clicking on a link doesn't take you off-site into the world wide wilderness, but to information about that object. For example, the page for the paper Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio) lists the papers cited, clicking on one takes you to the page about that paper. There are limitations to this approach as well, but the key thing is that one could imagine doing computations over this (e.g., computing citation counts for DNA sequences, or geospatial queries across papers) that simple HTML hyperlinking won't get you.

Reciprocal links

The other big issue I have with the Shotton et al. "integration by linking" is that it is one-way. The semantically enhanced paper "knows" that it links to, say, the uBio record for Leptospira, but uBio doesn't know this. It would enhance the uBio record if it knew that doi:10.1371/journal.pntd.0​000228.x001 linked to it.

Links are inherently reciprocal, in the sense that if paper 1 cites paper 2, then paper 2 is cited by paper 1.

Publishers understand this, and the web page of an article will often show lists of papers that cite the paper being displayed. How do we do this for data and other objects of interest? If we database everything, then it's straightforward. CrossRef is storing citation metadata and offers a "forward linking" service, some publishers (e.g., Elsevier and Highwire) offer their own versions of this. In the same way, this record for GenBank sequence AY322281 "knows" that it is cited by (at least) two papers because I've stored those links in a database. Knowing that you're being linked to dramatically enhances discoverability. If I'm browsing uBio I gain more from the experience if I know that the PLoS paper cites Leptospira.

Knowing when you're being linked to

If we database everything locally then reciprocal linking is easy. But, realistically, we can't database everything (OK, maybe that's not strictly true, can can think of Google as a database of everything). The enhanced PLoS paper "knows" that it cites the uBio record, how can the uBio record "know" that it has been cited by the PLoS paper? What if the act of linking was reciprocal? How can we achieve this in a distributed world? Some possibilities:
  • we have an explicit API embedded in the link so that uBio can extract the source of the link (could be spoofed, need authentication?)
  • we use OpenURL-style links that embed the PLoS DOI, so that uBio knows the source of the link (OpenURL is a mess, but potentially very powerful)
  • uBio uses the HTTP referrer header to get the source of the link, then parses the PLoS HTML to extract metadata and the DOI (ugly screen scraping, but no work for PLoS)

Obviously this needs a little more thought, but I think that real integration by linking requires that the resources being linked are both computer and human readable, and that both resources know about the link. This would create much more powerful "semantically enhanced" publications.

Friday, April 17, 2009

bioGUID manuscript

Finally submitted (two days late) a manuscript for the BMC Bioinformatics Special Issue on Biodiversity Informatics organised by Neil Sarkar and sponsored by EOL and CBOL. The manuscript, entitled "bioGUID: resolving, discovering, and minting identifiers for biodiversity informatics" describes my bioGUID project. If you are interested made pre-print available at Nature Precedings (hdl:10101/npre.2009.3079.1).

Thursday, April 16, 2009

LSIDs, disaster or opportunity

OK, really must stop avoiding what I'm supposed to be doing (writing a paper, already missed the deadline), but continuing the theme of LSIDs and short URLs, it occurs to me that LSIDs can be seen as a disaster (don't work in webrowsers, nobody else uses them, hard to implement, etc.) or an opportunity. The URL shortening service bit.ly provides information on each shortened URL (e.g., http://bit.ly/3R6apo, such as a summary of the URL


statistics on how the URL is being accessed


conversations (who is talking about this URL)



and lastly metadata harvested from the URL using services such as Open Calais


Imagine we provided the same services for LSIDs. In other words, instead of a simple HTTP proxy such as http://bioguid.info, the proxy stores information on how often the LSID is resolved (and by whom), where the LSID has been cited (web pages, papers, etc), and what metadata is has (the last bit is easy, we don't need a tool like Open Calais). We could extend this to other identifiers, such as DOIs (for which we could do things like show whether the DOI has been bookmarked by Connotea, CiteULike, etc.).

Now, if one of our large projects (e.g., GBIF or EOL) showed a little bit of ambition and creativity and did something like this, we could have a cool tool. Plus, we'd be aggregating metadata along the way. I think this could lead to the first "killer app" for biodiversity informatics.

Short URLs

Short URLs have been a topic of discussion recently, perhaps sparked by the article URL Shorteners: Which Shortening Service Should You Use?. Many will have encountered short URLs in Twitter tweets. Leigh Dodds (@ldodds) asked
Remind me: why do we need short urls at all, rather than a better solution? Removing arbitrary limits (or better impl. thereof) seems better
I guess Leigh's talking about the need for short URLs in tweets, but I wonder about the more general question of why we need URL shorteners at all. Reading the Guardian (physical copy in a coffee shop) I keep coming across URLs in the text, such as bit.ly/seth52, that is short, no "http://" prefix, and labelled in a human readable form (the short URL's in Seth Finkelstein's column are all of the form bit.ly/seth[n]).

It occurs to me that these URLs are almost like tags, the names have locally significant meaning, and are memorable. In a sense the URL shortening service acts as a new namespace. Imagine if you can't get a desired domain name, but can get a customised URL with that name. The tyranny of the DNS as the sole naming authority is weakened a little. In some ways this mirrors how many people use the web. Instead of typing in full domain names, they enter a search term into Google and go to the site they want (often the top hit). Imagine if Google provided a URL shortening service (in a sense their search engine is a slightly clunky one already).

The other reasons I'm interested in this is because of ugly identifiers such as urn:lsid:zoobank.org:act:6FFAFC2C-D46B-4959-BA03-C38477B9DFF1. This version, bit.ly/polina is a bit nicer. Plus, I get usage statistics on the short version (meaning I don't need to implement this myself). If we use the Guardian as an example, perhaps journal publishers using LSIDs such as urn:lsid:zoobank.org:act:6FFAFC2C-D46B-4959-BA03-C38477B9DFF1 would prefer to use custom, shortened URLs to make the text more readable, and collect usage statistics as well.

Wednesday, April 15, 2009

LSIDs, to proxy or not to proxy?

The LSID discussion rumbles on (see my earlier post). One issue that has re-emerged is the use of HTTP proxies in RDF documents. In a recent email Greg Whitbread wrote:

The existing TDWG recommendation that "5. All references to LSIDs within RDF documents should use the proxified form", basically states that LSID will never appear in any way other than bundled into an http URI - if we are also to publish data as RDF.

That sounds as if it means that those wanting to use LSID resolution will first have to extract the LSID part from the http URI which will now appear everywhere we would expect to find our unique identifier.

Donald [Hobern] has presented a strong case for unique identifiers conforming to the LSID specification but we have now an equally strong case that in its http form our identifier must behave as a dereferenceable URN per W3C linked data recommendations.
My own view is that the RDF should always contain a canonical, un-proxied version of an identifier (whether LSID or DOI), because:
  1. having only the proxied version assumes that there is only one suitable proxy (there may be multiple ones)
  2. it assumes that the specified proxy will always exist (our track record in durable HTTP services is poor)
  3. assumes the specified proxy will always match conform to current standards
  4. it imposes an overhead on clients that want the canonical identifier (i.e., they have to strip away the proxy)
I predict that for any meaningful, successful (read "actually used") identifier there will be multiple services that will be capable of consuming that identifier, not just HTTP proxies. DOIs can be proxied (by several servers, including http://dx.doi.org/ and http://hdl.handle.net ), resolved using OpenURL resolvers, etc.

In order to play ball with Linked Data, there are several ways forward:
  1. always refer to LSIDs in their proxied form (see above for reasons why this might not be a good idea)
  2. ensure that at least one proxy exists which can resolve LSIDs in a linked data friendly way (see bioGUID as an example)
  3. use or develop linked data clients that understand LSIDs (e.g., http://linkeddata.uriburner.com/, see this view of urn:lsid:zoobank.org:pub:2C6BD020-B54A-4119-9693-3231C9FCEFA6)
2 and 3 already exist, so I'm not so keen on 1.

For me this is one of the biggest hurdles facing using HTTP URIs as identifiers -- I have to choose one. As an analogy, I can identify a book using an ISBN (say, 0226644677). How do I represent this in RDF? Well, I could use an HTTP URI, say http://www.amazon.com/Tangled-Trees-Phylogeny-Cospeciation-Coevolution/dp/0226644677/ , or maybe http://www.worldcat.org/isbn/0226644677. There are many, many I could choose from. However, so long as I know that the ISBN is 0226644677, I'm free to use whatever URI best suits my needs. So, what I really want is the ISBN by itself.

Imagine, for example, a publisher such as PLoS or Magnolia Press (publisher of Zootaxa), both of which have recently published taxonomic papers containing LSIDs (e.g., doi:10.1371/journal.pone.0001787). They might want to display LSIDs linked to their own LSID resolver that embellishes the metadata with information they have (e.g., they might wish to highlight links to other content that they host). In a sense this is much the same idea as supported by OpenURL COinS, where OpenURL-format metadata is embedded in a HTML document and the user choose what resolver to use to resolve the links (including tools such as Zotero).

Having LSIDs prefixed with a HTTP proxy makes these task a little harder.

Saturday, April 11, 2009

LSIDs, HTTP URI, Linked Data, and bioGUID

The LSID discussion has flared up (again) on the TDWG mailing lists. This discussion keeps coming around (I've touched on it here and here), this time it was sparked by the LSID SourceForge site being broken (the part where you get the code is OK). Some of the issues being raised include:
  • Nobody uses LSIDs except the biodiversity informatics crowd, have we missed something?
  • LSIDs don't play nice with the Linked Data/Semantic Web world, which is much bigger than us
  • If we adopt HTTP URIs, will this send the wrong message to data providers (LSIDs imply a commitment to persistence, URLs don't)
  • The community has invested a lot in LSIDs, it's too late to change course now
There are other issues as well, in many ways much harder, namely how to ensure adoption and long term persistence of whatever identifier technology the techies agree on.

I've been twittering (@rdmpage) about some of this, and Pierre Lindenbaum blogged about my earlier paper on testing LSIDs (doi:10.1186/1751-0473-3-2), so I decided to return to one of the original goals of my bioGUID project, namely providing a tool to resolve existing identifiers in a consistent way (see the now moribund bioGUID blog, I now blog about bioGUID here on iPhylo). One of the goals of bioGUID was to take an identifier and return RDF. I also had an underlying triple store that was populated with this RDF. After a hardware crash I took the opportunity to rebuild bioGUID from scratch, focussing on OpenURL access to literature. Now, I'm looking at LSIDs again.

The standard response to the concern that the rest of the world has gone down the HTTP URI route is to say that we can stick a HTTP proxy on the front of the LSID (e.g., http://lsid.tdwg.org/urn:lsid:indexfungorum.org:names:21364) and play ball with the Linked Data crowd, who are rapidly linking diverse data sets together:

However, sticking a HTTP proxy on an LSID isn't enough. As outlined in the document Cool URIs for the Semantic Web, we need a way of distinguishing between a HTTP URI that identifies real-world objects or concepts (such as a person or a car), and documents describing those things (put another way, if I put a HTTP URI for Angelina Jolie into a web browser, I expect to get a document describing her, not Ms Jolie herself) . One solution (and the one that is gaining traction) is to use 303 redirects to make this explicit:

A client resolving a URI for a thing will get a 303 status code, telling them that the URI identifies an object. They can get the appropriate representation via content negotiation (a web browser wants HTML, a linked data browser wants RDF).

Data URIs. So, in order to get LSIDs to play ball with Linked Data we need a HTTP proxy that supports 303 redirects (as Roger Hyam pointed out). I've implemented a simple one as part of bioGUID. If you append a LSID to http://bioguid.info/ you get a HTTP URI that passes the
Vapour Linked Data validator tests. For example, http://bioguid.info/urn:lsid:indexfungorum.org:names:21364 resolves to a web page in a browser, but clients that ask for RDF will get that. You can see the steps involved in resolving this Cool URI here. Vapour provides a nice graphical overview of the process:



The TDWG LSID proxy doesn't validate, so this is something that should be addressed.

In addition to resolving LSIDs, my service stores the resulting RDF in a triple store using ARC, and you can query this triple store using a SPARQL interface that makes use of Danny Ayers' Javascript SPARQL editor. I've a serious case of déjà vu as I've implemented this feature several times before using 3store3 (usually after much fun getting it to work). I got bored with triple stores as the bigger problem seemed to be the errors in the metadata I was harvesting, which seriously limited my ability to link different objects together (but that's another story).

Wednesday, April 08, 2009

Patenting biodiversity tools

The e-Biosphere online forum has a topic entitled Why open source code? Why not patented softwares?. This thread was started by Mauri Åhlberg, who notes that the EOL codebase is open source, and asks "why not patent software." Åhlberg has patented NatureGate (US Patent 7400295), which claims:
In the method of the invention objects can be identified on the basis of location and one or more characteristics in an improved way. The method is performed by means of a user device and a service product offering a service with which objects can be identified. In the method of the invention, the object to be identified is positioned and the position of the object is informed to the service to which the user device has connected. The user of the user device selects one or more characteristics presented by the service for an object to be identified. A message containing the position of the object to be identified and selected characteristic(s) is then sent from the user device to the service product. The service fetches information on the basis of the position of the object and the selected characteristic(s) from a database. The fetched information is presented for the user device in the form of alternative objects to be identified. The system of the invention is characterized by a...
The description seems very general, and I can't see anything that qualifies as novel, but on a quick read suggests that anybody writing, say, an iPhone application to identify an organism based on where you are and what it looks like will be in trouble.

There are other patent applications in this area, such as Managing taxonomic information by the developers of uBio, which claims:
In a management of taxonomic information, a name that specifies an organism is identified. Based on the name and a database of organism names or classifications, another name that specifies the organism and that represents a link between pieces of biological identification information in the database, or a classification for the organism, is determined. Based on the other name or the classification, information associated with the organism is identified.


Now, I'm clearly no expert on software patents, and this is a contentious area, but this strikes me as a little worrying (worst case scenario, biodiversity informatics becomes a victim of patent trolls). One could argue that patents can be used defensively (i.e., to ensure that a technology is not claimed by another, say commercial, party who then limits access based on cost), but I'd like to see a little more discussion of these issues by the biodiversity community.

e-Biosphere Challenge


The e-Biosphere meeting in London June 1-3 has announced a The e-Biosphere 09 Informatics Challenge:

Prepare and present a real-time demonstration during the days of the Conference of the capabilities in your community of practice to discover, disseminate, integrate, and explore new biodiversity-related data by:
  • Capturing data in private and public databases
  • Conducting quality assurance on the data by automated validation and/or peer review
  • Indexing, linking and/or automatically submitting the new data records to other relevant databases
  • Integrating the data with other databases and data streams
  • Making these data available to relevant audiences
  • Make the data and links to the data widely accessible and
  • Offering interfaces for users to query or explore the data.

The "real time" aspect of the challenge seems a bit forced. I think they originally wanted a "live demo", but now they seem to be happy with a demo that unfolds over the three day meeting, without necessarily literally taking three days (what the organisers term "cooking shows"). I also think cash prizes would have been a good idea (the web site simply says "there will be prizes"). It's not the cash itself that matters, it's the fact that it indicates that the organisers are serious about wanting to attract entries. Entrants are likely to invest more time than they'd recoup in cash.

In any event, given that challenges are a great way to focus the mind on a deadline, I'll be entering the wiki of taxonomic names that I've been working on.