Pages

Sunday, December 10, 2006

Including SMILES, CML and InChI in blogs

The blogs ChemBark and KinasePro have been discussing the use of SMILES, CML and InChI in Chemical Blogspace (with 70 chemistry blogs now!). Chemists seem to prefer SMILES over InChI, while there is interest in moving towards CML too. Peter commented.

Any incorporation of content other than images and free text requires some HTML knowledge, but this can be rather limited. It is up to us chemoinformaticians to write good documentation on how to do things; so here is a first go.

Including CML in blogs and other RSS feeds

I blogged about including CML in blogs last February, and can generally refer to this article published last year: Chemical markup, XML, and the World Wide Web. 5. Applications of chemical metadata in RSS aggregators (PMID:15032525, DOI:10.1021/ci034244p). Basically, it just comes down to putting the CML code into the HTML version of your blog content, though I appreciate the need for plugins.

Including SMILES, CAS and InChI in blogs

Including SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.

Now, users of PostGenomic.com know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into postgenomic.com (which was put on lower priority because of finishing my PhD). PostGenomic.com basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of <span class="chemicalcompound">asperin.

And this is the way SMILES, CAS and InChI's can be tagged on blogs. The <span> element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:
  • for SMILES: <span class="smiles">CCO</span>
  • for CAS registry numbers: <span class="casnumber">50-00-0</span>
  • for InChI: <span class="inchi">InChI=1/CH4/h1H4</span>

The RDFa alternative

The future, however, might use RDFa over microformats, so here are the RDFa equivalents:
  • for SMILES: <span class="chem:smiles">CCO</span>
  • for CAS registry numbers: <span class="chem:casnumber">50-00-0</span>
  • for InChI: <span class="chem:inchi">InChI=1/CH4/h1H4</span>

which requires you to register the namespace xmlns:chem="http://www.blueobelisk.org/chemistryblogs/" somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.

9 comments:

  1. Hi,

    Tagged SMILES added as you suggested, many thanks

    http://homepage.mac.com/swain/Sites/CMC/News/files/ab724be749104d0ef80af00d3db8cc63-8.html

    ReplyDelete
  2. Hi Egon,

    This is all very interesting. :)

    I know nothing about chemistry, but I think I can see enough about what you are trying to do to say that RDFa is way out in front for your needs...contrary to your view that:

    "[T]his [using RDFa] is more advanced, and currently does not have practical advantages over the use of microformats."

    The first advantage is that terms are namespace qualified. There are only two situations I can think of where this would be of no use to you, and both of them seem very unlikely to me.

    The first is that your blogs, and your colleagues' blogs, only ever talk about chemistry, i.e., you won't ever want to use terms from other disciplines. The second, is that no-one else will ever use your terms.

    The former seems very likely to me, since your terms would surely be useful in blogs about medicine and physics at the very least? On the latter, do you really think that no other discipline or organisation will use the unqualified term 'smiles'? :) If they do, then your search results will go back to being as bad as they probably are now.

    (Like I said, I know nothing of chemistry, but I'm assuming that a Google search for "acetic acid" is pointless for you guys...is that right?)

    So if you were going to, how should you use namespaces? Your suggestion when illustrating RDFa is that you have one prefix, and then use @class to set 'types'

    "* SMILES: <span class="chem:smiles">CCO</span>
    * for CAS registry numbers: <span class="chem:casnumber">50-00-0</span>
    * for InChI: <span class="chem:inchi">InChI=1/CH4/h1H4</span>"

    However, in the world of RDF, you generally want each organisation to 'own' its own terms, or its own taxonomies. So I would have thought you'd want to have terms like this:

    smiles:cco
    cas:50-00-0
    cid:176

    SMILES, CAS and CID would have namespace URLs of their own, and each organisation would define its terms.

    Using the last one as an example ('cid:176'), in RDFa you would be able to use such a term like this:

    I had run out of <span content="cid:176">acetic acid</span>, but luckily Egon had some <span content="cid:176">Natriumacetat</span>.

    (I told you I know nothing of chemistry...I just perused the NCBI site to make up some examples! But I hope the point being made is clear.)

    Note that I've used the attribute @content which comes form <meta>, but is made more widely available by RDFa. This allows you to give a precise term for whatever is in the element, and the classic example we usually give is something like:

    Today the <span content="people:TonyBlair">Prime Minister</span> flew to the <span content="country:usa">US</span> for talks.

    But whilst RDFa can be very simple (as these exampled show), it also opens up the possibility of providing far more metadata. For example, the NCBI site could embed RDFa in its pages, and then an RDFa processor could pluck out the information and make use of it:

    <div about="cid:176">Acetic acid has the same parent compound as <a rel="ncbi:parent" href="[cid:11954357]">Cupric acetate</a>.</div>

    (I have no idea whether that's completely non-sensical. :) I just tried to find some relationships between items to illustrate the point.)

    Regards,

    Mark

    ReplyDelete
  3. Hi Mark,

    Thanx for your comments. I am now actively promoting RDFa to do this semantic chemistry and have extended an aggregator to detect those semantics:

    http://wiki.cubic.uni-koeln.de/cb/inchis.php

    ReplyDelete
  4. Hi Egon,

    Very nice work. What if I don't want the InChI to actually be visible on the page, but just indexed by Google, CB, etc.?

    ReplyDelete
  5. Richard,

    some people put the InChI in the @alt attribute when giving an image, but this lacks semantics.

    To get it indexed by Google, you might want to try to put it in a keyword in the header:

    <META NAME="keywords" CONTENT="InChI=1/bla, InChI=1/bla2, etc">

    Never tried that myself yet, though...

    PS. you blogger profile is not public, so can't see to which blog the name 'Richard' belongs...

    ReplyDelete
  6. The XMLNS isn`t working anymore (error 404)

    ReplyDelete
  7. span class="chemicalcompound">asperin
    Isn`t such a strange idea
    but putting it into line with the others gives something like:
    span class="chem:compound">asperin.

    ReplyDelete
  8. Ramoonus: the NS never actually resolved to a webpage. Namespaces are URIs, not URLs.

    About, chemicalcompound. Yeah, better use chem:compound there, which would be namespaced.

    I'm posting some more examples in my blog right now.

    ReplyDelete