Pages

Tuesday, July 31, 2007

RDF-ing molecular space

RDF might be the solution we are looking for to get a grip on the huge amount of information we are facing. microformats, and RDFa, are just solutions along the way, and Gleaning Resource Descriptions from Dialects of Languages (GRDDL) might be an important tool to get the web RDF-ied.

One important aspect of RDF is that any resource has a unique URI. These make look like a URL or even like urn:doi:10.1186/1471-2105-8-59. The recent blogs by Pierre (URL +1, LSID -1) and Roderic (Rethinking LSIDs versus HTTP URI) illustrate the pro and cons of the different alternatives.

bioGUID
As usual, the bioinformaticians are less conservative and ahead of chemists in trying new options, and several interesting website have emerged. For example, bioGUID makes the bridge between a simple URI and a resolvable URL. And, importantly, it spit RDF. This is the output for http://bioguid.info/doi:10.1109/MIS.2006.62:
<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="http://bioguid.info/xsl/html.xsl"?>
<rdf:RDF xmlns:bioguid="http://bioguid.info/schema/0.1/"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:rss="http://purl.org/rss/1.0/"
xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="http://bioguid.info/doi:10.1109/MIS.2006.62">
<rdf:type rdf:resource="http://bioguid.info/schema/0.1/Publication"/>
<rdfs:comment>Generated by transforming XML returned by CrossRef's
OpenURL service.</rdfs:comment>
<dc:creator>Shadbolt</dc:creator>
<dc:title>The Semantic Web Revisited</dc:title>
<dcterms:issued>2006</dcterms:issued>

<prism:publicationDate>2006</prism:publicationDate>
<dc:identifier rdf:resource="doi:10.1109/MIS.2006.62"/>
<rdfs:comment>info URI scheme</rdfs:comment>
<dc:identifier rdf:resource="info:doi/10.1109/MIS.2006.62"/>
<rdfs:comment>CrossRef resolver</rdfs:comment>
<rss:link>http://dx.doi.org/10.1109/MIS.2006.62</rss:link>
<prism:publicationName>IEEE Intelligent Systems</prism:publicationName>

<prism:volume>21</prism:volume>
<prism:number>3</prism:number>
<prism:startingPage>96</prism:startingPage>
<prism:issn>10947167</prism:issn>
</rdf:Description>
</rdf:RDF>

(BTW, interesting is the use of XSLT to create HTML; it's doing the opposite of GRDDL! And this is probably the right way. Cheers Roderic!)

InChI
I wanted something similar for molecules. The unique identifier is the InChI, of course. The InChI itself is not a proper URI, so I set up a webpage to work around that (if only I had realized this some time ago, I would have urged IUPAC to use the prefix 'inchi:' instead of 'InChI='). The result is, currently, looking like http://cb.openmolecules.net/rdf/rdf.php?InChI=1/CH4/h1H4. I do not use a XSLT yet, but will do so shortly. The RDF looks like:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:iupac="http://www.iupac.org/">

<rdf:Description
rdf:about="http://cb.openmolecules.net/rdf/?InChI=1/CH4/h1H4">

<iupac:inchi>InChI=1/CH4/h1H4</iupac:inchi>

<pubchem:cid xmlns:pubchem="http://pubchem.ncbi.nlm.nih.gov/#">297</pubchem:cid>
<pubchem:name xmlns:pubchem="http://pubchem.ncbi.nlm.nih.gov/#">methane</pubchem:name>
<cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chemistrylabnotebook.blogspot.com/2007/04/space-final-frontier.html</cb:discussedBy>
<cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=299</cb:discussedBy>
<cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chem-bla-ics.blogspot.com/2006/12/smiles-cas-and-inchi-in-blogs.html</cb:discussedBy>
<cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chem-bla-ics.blogspot.com/2007/02/invisible-inchis.html</cb:discussedBy>

</rdf:Description>

</rdf:RDF>

The system uses PHP to create the output, and has a basis pluggable system: a plugin basically spits a RDF fragment for the given InChI, and at this moment it only has a plugin for Cb, but I plan a few more. It needs some tuning and any and all feedback is most welcome. Note that the actual URI might change a bit.

Update: synchronized URI with implementation.

8 comments:

  1. Egon,

    I really like what you do here.

    Just one question: What do you mean by 'RDF might be the solution' vs. 'uF and RDFa are just solutions along the way'?

    RDFa is, as RDF/XML, a serialisation of (or concrete syntax for) the RDF the model. Could you pls. clarify this ...

    Cheers,
    Michael

    ReplyDelete
  2. Michael,

    RDFa is indeed a way to represent triples, and close enough for machines to make the translation to pure RDF itself. I like the RDFa technology a lot, but like the combination of RDF+XSLT even more.

    But I have to agree that an RDF browser will just as easily process RDF as it will RDFa. Quite right.

    ReplyDelete
  3. Egon,

    Thanks for your answer. Still I think we have a misunderstanding, here ;)

    When I hear/read RDF, I'm thinking of the model (http://www.w3.org/TR/rdf-concepts/#section-data-model), no concrete syntax involved, yet.

    So, to put it in other words: What are you referring to when you say 'pure RDF'?

    Cheers,
    Michael

    ReplyDelete
  4. With pure I meant RDF-namespace only. This is not quite accurate as we will likely not see many RDF-namespace only documents, but the document structures is rooted in RDF and not XHTML.

    I am not sure how well RDF libraries, such as Jena and Sesame, will deal with RDFa embedded in XHTML. Nothing that cannot be easily solved, though.

    BTW, you stopped blogging in the RDFa devel corner?

    ReplyDelete
  5. Hi Egon:

    Good to see you're getting some RDF religion. :)

    Need to pick you up on one point. InChI *does* have its own proper URI representation - as an "info:" URI namespace. See this post of mine form back in Feb.

    http://www.crossref.org/CrossTech/2007/02/at_last_uris_for_inchi.html

    Also, as an example of InChI URI's in use (here within an RSS feed) see this post from RSC:

    http://www.crossref.org/CrossTech/2007/05/rscs_project_prospect_v11.html

    Btw, would also like to add something about DOI and URI representation. There are two. One official and used in machine contexts is the "info:" URI namespace, e.g.

    info:doi/10.1000/1

    The other unofficial, but used in human contexts and as a CrossRef citation guidelines is the native form

    doi:10.1000/1

    Note the latter was put up earlier for registration with IANA but was rejected at that time. It is intended to resubmit this when DOI has been through the ISO process. At this time there is no intention to register a URN namespace. So the form

    urn:doi:10.1000/1

    is *not* something that is used anywhere (and is also an non-conformant URN anyway).

    Carry on with all that RDF goodness.

    Cheers,

    Tony

    ReplyDelete
  6. Tony, I am happy to hear that. Please try this:

    http://cb.openmolecules.net/rdf/?info:inchi/InChI=1/CH4/h1H4

    Works with the old URL too.

    ReplyDelete
  7. The new, final (for a long time I hope) is:

    http://rdf.openmolecules.net/?info:inchi/InChI=1/CH4/h1H

    being equivalent with:

    http://rdf.openmolecules.net/?InChI=1/CH4/h1H

    ReplyDelete
  8. I created an SVG diagram of hydroxide which includes RDF in the SVG metadata tag: http://upload.wikimedia.org/wikipedia/commons/b/b0/Hydroxide_lone_pairs-2D.svg

    It parses nicely in the W3C's RDF validator http://www.w3.org/RDF/Validator.
    Your feedback would be appreciated. I am wondering if I should add direct links to iupac.org, pubchem and cb, like the example you have for methane, instead of dc:identifier.

    ReplyDelete