Thursday, May 03, 2007

Cb comments for InChI's

About a year ago Pedro wrote a Greasemonkey script to add comments from to table of contents of scientific journals. Noel extended it with support for Chemical blogspace (see also this earlier item). Now, the later website is maintained by me, and I extended the aggregator software with molecule support, for example to show hot molecules on the frontpage (at some point my patches will be backported into mainstream. Euan, why not invite me to London HQ in, say, June?).

So, when we can show comments from blogosphere for journal articles, why can't we do that for molecules too? Sure we can. Just needs some hacking. Right, and done that today. The scripts works for PubChem:

Works for any <a href> element with an URL to PubChem like[InChI]. BTW, while the URL is not very readable, this might actually be a good way to hide InChI's, though I am sure Google will not index this InChI either.

And it also works for semantically marked up InChI's (using either microformats or RDFa):

You'll notice here that it is friendly with my Sechemtic script to make links to Google and PubChem.

The tools to make this happen involves a new Greasemonkey script (based on Noels code), and a few patches to the software. The user script can be downloaded here. An entry on the Blue Obelisk userscript page will follow; check that page for more goodies.


  1. (1) Would it be possible to pattern match against an InChI so that it wouldn't need to be marked up?

    (2) I note that Google searches just ignore punctuation, apparently replacing any punctuation in your search by a space. This suggests that Google-friendly InChIs would need to replace the '/' by an '_' or something.

  2. ad 1: yes, but I do not think it would be useful. The first version I wrote was like this, but InChI's are rarely put full text in HTML. Really rarely. It turned out that this gave too many false positives.

    ad 2: on the InChI mailing list there has been some discussion on making InChI's more friendly to other technologies, e.g. by defining a well defined way to put them on the web. No convergence whatsoever, unfortunately.

  3. Nice. Also, consider yourself invited (let me know what dates you're in the UK?).

  4. (1) I would be surprised at high false positives for InChIs. I suppose this depends on how specific you can make the regular expression; e.g. if you include the word "InChI:" itself, there should be no false postives, right? I certainly do get false positives for PDB codes and am considering something like searching the page for the word "PDB" , "protein", or "enzyme", etc., before adding a link to First Glance In Jmol.

    (2) It would be really nice to have an image of the molecule pop up. Don't you have these images at CB?

  5. Noel, it's really just a problem with line breaking the InChI's in one of the layer, which results in part-InChI's being grepped.

    About those images... I'll work on that.

  6. What are you talking about?? Google never indexes InChI. Even if your links are right like and with "a href", it won't a guarantee for indexing. Google is too finical dude :)