## Tuesday, July 31, 2007

### RDF-ing molecular space

RDF might be the solution we are looking for to get a grip on the huge amount of information we are facing. microformats, and RDFa, are just solutions along the way, and Gleaning Resource Descriptions from Dialects of Languages (GRDDL) might be an important tool to get the web RDF-ied.

One important aspect of RDF is that any resource has a unique URI. These make look like a URL or even like urn:doi:10.1186/1471-2105-8-59. The recent blogs by Pierre (URL +1, LSID -1) and Roderic (Rethinking LSIDs versus HTTP URI) illustrate the pro and cons of the different alternatives.

bioGUID
As usual, the bioinformaticians are less conservative and ahead of chemists in trying new options, and several interesting website have emerged. For example, bioGUID makes the bridge between a simple URI and a resolvable URL. And, importantly, it spit RDF. This is the output for http://bioguid.info/doi:10.1109/MIS.2006.62:
<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="http://bioguid.info/xsl/html.xsl"?><rdf:RDF xmlns:bioguid="http://bioguid.info/schema/0.1/"   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"  xmlns:rss="http://purl.org/rss/1.0/"   xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/"  xmlns:dcterms="http://purl.org/dc/terms/"   xmlns:dc="http://purl.org/dc/elements/1.1/"  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">  <rdf:Description rdf:about="http://bioguid.info/doi:10.1109/MIS.2006.62">    <rdf:type rdf:resource="http://bioguid.info/schema/0.1/Publication"/>    <rdfs:comment>Generated by transforming XML returned by CrossRef's      OpenURL service.</rdfs:comment>    <dc:creator>Shadbolt</dc:creator>    <dc:title>The Semantic Web Revisited</dc:title>    <dcterms:issued>2006</dcterms:issued>    <prism:publicationDate>2006</prism:publicationDate>    <dc:identifier rdf:resource="doi:10.1109/MIS.2006.62"/>    <rdfs:comment>info URI scheme</rdfs:comment>    <dc:identifier rdf:resource="info:doi/10.1109/MIS.2006.62"/>    <rdfs:comment>CrossRef resolver</rdfs:comment>    <rss:link>http://dx.doi.org/10.1109/MIS.2006.62</rss:link>    <prism:publicationName>IEEE Intelligent Systems</prism:publicationName>    <prism:volume>21</prism:volume>    <prism:number>3</prism:number>    <prism:startingPage>96</prism:startingPage>    <prism:issn>10947167</prism:issn>  </rdf:Description></rdf:RDF>

(BTW, interesting is the use of XSLT to create HTML; it's doing the opposite of GRDDL! And this is probably the right way. Cheers Roderic!)

InChI
I wanted something similar for molecules. The unique identifier is the InChI, of course. The InChI itself is not a proper URI, so I set up a webpage to work around that (if only I had realized this some time ago, I would have urged IUPAC to use the prefix 'inchi:' instead of 'InChI='). The result is, currently, looking like http://cb.openmolecules.net/rdf/rdf.php?InChI=1/CH4/h1H4. I do not use a XSLT yet, but will do so shortly. The RDF looks like:
<rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:iupac="http://www.iupac.org/"><rdf:Description rdf:about="http://cb.openmolecules.net/rdf/?InChI=1/CH4/h1H4"> <iupac:inchi>InChI=1/CH4/h1H4</iupac:inchi> <pubchem:cid xmlns:pubchem="http://pubchem.ncbi.nlm.nih.gov/#">297</pubchem:cid> <pubchem:name xmlns:pubchem="http://pubchem.ncbi.nlm.nih.gov/#">methane</pubchem:name> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chemistrylabnotebook.blogspot.com/2007/04/space-final-frontier.html</cb:discussedBy> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=299</cb:discussedBy> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chem-bla-ics.blogspot.com/2006/12/smiles-cas-and-inchi-in-blogs.html</cb:discussedBy> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chem-bla-ics.blogspot.com/2007/02/invisible-inchis.html</cb:discussedBy></rdf:Description></rdf:RDF>

The system uses PHP to create the output, and has a basis pluggable system: a plugin basically spits a RDF fragment for the given InChI, and at this moment it only has a plugin for Cb, but I plan a few more. It needs some tuning and any and all feedback is most welcome. Note that the actual URI might change a bit.

Update: synchronized URI with implementation.

1. Egon,

I really like what you do here.

Just one question: What do you mean by 'RDF might be the solution' vs. 'uF and RDFa are just solutions along the way'?

RDFa is, as RDF/XML, a serialisation of (or concrete syntax for) the RDF the model. Could you pls. clarify this ...

Cheers,
Michael

2. Michael,

RDFa is indeed a way to represent triples, and close enough for machines to make the translation to pure RDF itself. I like the RDFa technology a lot, but like the combination of RDF+XSLT even more.

But I have to agree that an RDF browser will just as easily process RDF as it will RDFa. Quite right.

3. Egon,

Thanks for your answer. Still I think we have a misunderstanding, here ;)

When I hear/read RDF, I'm thinking of the model (http://www.w3.org/TR/rdf-concepts/#section-data-model), no concrete syntax involved, yet.

So, to put it in other words: What are you referring to when you say 'pure RDF'?

Cheers,
Michael

4. With pure I meant RDF-namespace only. This is not quite accurate as we will likely not see many RDF-namespace only documents, but the document structures is rooted in RDF and not XHTML.

I am not sure how well RDF libraries, such as Jena and Sesame, will deal with RDFa embedded in XHTML. Nothing that cannot be easily solved, though.

BTW, you stopped blogging in the RDFa devel corner?

5. Hi Egon:

Good to see you're getting some RDF religion. :)

Need to pick you up on one point. InChI *does* have its own proper URI representation - as an "info:" URI namespace. See this post of mine form back in Feb.

http://www.crossref.org/CrossTech/2007/02/at_last_uris_for_inchi.html

Also, as an example of InChI URI's in use (here within an RSS feed) see this post from RSC:

http://www.crossref.org/CrossTech/2007/05/rscs_project_prospect_v11.html

Btw, would also like to add something about DOI and URI representation. There are two. One official and used in machine contexts is the "info:" URI namespace, e.g.

info:doi/10.1000/1

The other unofficial, but used in human contexts and as a CrossRef citation guidelines is the native form

doi:10.1000/1

Note the latter was put up earlier for registration with IANA but was rejected at that time. It is intended to resubmit this when DOI has been through the ISO process. At this time there is no intention to register a URN namespace. So the form

urn:doi:10.1000/1

is *not* something that is used anywhere (and is also an non-conformant URN anyway).

Carry on with all that RDF goodness.

Cheers,

Tony

6. Tony, I am happy to hear that. Please try this:

http://cb.openmolecules.net/rdf/?info:inchi/InChI=1/CH4/h1H4

Works with the old URL too.

7. The new, final (for a long time I hope) is:

http://rdf.openmolecules.net/?info:inchi/InChI=1/CH4/h1H

being equivalent with:

http://rdf.openmolecules.net/?InChI=1/CH4/h1H

8. I created an SVG diagram of hydroxide which includes RDF in the SVG metadata tag: http://upload.wikimedia.org/wikipedia/commons/b/b0/Hydroxide_lone_pairs-2D.svg

It parses nicely in the W3C's RDF validator http://www.w3.org/RDF/Validator.
Your feedback would be appreciated. I am wondering if I should add direct links to iupac.org, pubchem and cb, like the example you have for methane, instead of dc:identifier.