Tuesday, July 31, 2007

RDF-ing molecular space

RDF might be the solution we are looking for to get a grip on the huge amount of information we are facing. microformats, and RDFa, are just solutions along the way, and Gleaning Resource Descriptions from Dialects of Languages (GRDDL) might be an important tool to get the web RDF-ied.

One important aspect of RDF is that any resource has a unique URI. These make look like a URL or even like urn:doi:10.1186/1471-2105-8-59. The recent blogs by Pierre (URL +1, LSID -1) and Roderic (Rethinking LSIDs versus HTTP URI) illustrate the pro and cons of the different alternatives.

As usual, the bioinformaticians are less conservative and ahead of chemists in trying new options, and several interesting website have emerged. For example, bioGUID makes the bridge between a simple URI and a resolvable URL. And, importantly, it spit RDF. This is the output for
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href=""?>
<rdf:RDF xmlns:bioguid=""
<rdf:Description rdf:about="">
<rdf:type rdf:resource=""/>
<rdfs:comment>Generated by transforming XML returned by CrossRef's
OpenURL service.</rdfs:comment>
<dc:title>The Semantic Web Revisited</dc:title>

<dc:identifier rdf:resource="doi:10.1109/MIS.2006.62"/>
<rdfs:comment>info URI scheme</rdfs:comment>
<dc:identifier rdf:resource="info:doi/10.1109/MIS.2006.62"/>
<rdfs:comment>CrossRef resolver</rdfs:comment>
<prism:publicationName>IEEE Intelligent Systems</prism:publicationName>


(BTW, interesting is the use of XSLT to create HTML; it's doing the opposite of GRDDL! And this is probably the right way. Cheers Roderic!)

I wanted something similar for molecules. The unique identifier is the InChI, of course. The InChI itself is not a proper URI, so I set up a webpage to work around that (if only I had realized this some time ago, I would have urged IUPAC to use the prefix 'inchi:' instead of 'InChI='). The result is, currently, looking like I do not use a XSLT yet, but will do so shortly. The RDF looks like:



<pubchem:cid xmlns:pubchem="">297</pubchem:cid>
<pubchem:name xmlns:pubchem="">methane</pubchem:name>
<cb:discussedBy xmlns:cb=""></cb:discussedBy>
<cb:discussedBy xmlns:cb=""></cb:discussedBy>
<cb:discussedBy xmlns:cb=""></cb:discussedBy>
<cb:discussedBy xmlns:cb=""></cb:discussedBy>



The system uses PHP to create the output, and has a basis pluggable system: a plugin basically spits a RDF fragment for the given InChI, and at this moment it only has a plugin for Cb, but I plan a few more. It needs some tuning and any and all feedback is most welcome. Note that the actual URI might change a bit.

Update: synchronized URI with implementation.