## Wednesday, January 30, 2008

### Why chemistry-rich RSS feeds matter...

Peter wrote up an item on Nick's CrystalEye's RSS feed, and I have been enthusiastic about chemistry-enriched RSS feeds for some time. CMLRSS has the chemical data inline in the RSS; see DOI:10.1021/ci034244p, the use of CMLRSS in Chemical blogspace described here and here, and the CMLRSS support in Bioclipse.

Nick's RSS feed does not put the chemistry inline, but does link to the raw CML file:
<entry>  <title>No title supplied</title>  <link rel="enclosure" href="http://wwmm.ch.cam.ac.uk/crystaleye/summary//acs/inocaj/2008/3/data/ic702497x/ic702497xsup1_THP4-SINC-publ/ic702497xsup1_THP4-SINC-publ.complete.cml.xml" hreflang="en" />  <!-- much more, that I skipped for brevity --></entry>
The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services.

Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)).

Done? Checked it? You saw the problem, right? Good.

I have scanned the CIF source, but that does not seem to contain the problem. It nicely shows a general limitation of commonly used chemoinformatics tools: the lack of proper atom typing (a problem I have been looking into for the Chemistry Development Kit; see Atom Typing in the CDK and Evidence of Aromaticity.).

You will have noted that the 2D diagram in Peter's blog is charged. I checked the complete CML source code for the CrystelEye entry, and that contains the charges on the two oxygens bound to the cupper too. However, the copper is not charged. That leads to a rather unlike situation; that is, that crystal structures will about attract the whole laboratory to itself in a blink of an eye: there is nothing to balance the double-negative charge! It is conveniently summarized in this bit of the CML:
<formula formalCharge="-2" concise="C 10 H 6 Cu 1 N 4 O 4 -2">  <atomArray elementType="C H Cu N O" count="10.0 6.0 1.0 4.0 4.0"/></formula>
Now, I also checked the raw CML; that seems to be unaffected too. So, the bug must be somewhere in the software that converts the raw CML into complete CML. And, before the InChI calculation, because that one is wrong too. A agent scanning the RSS feed, would have detected this. Someone interested in writing up a grant proposal on this?

BTW, the system is not awfully wrong: the negative charge on the acidic carboxyl groups is to be expected. But if the bond between the oxygen and the carbon would have been coordinating, not covalent, and the copper would have been +2, then it was fine. Because many chemoinformatics tools do not have really support for dative bonds, a covalent bond could be drawn, but then the oxygens should be uncharged... right, not? :)

Oh, and surely, one can do much, much more with those feeds. I blogged about that earlier in Automatic Classification of thousands of Crystal Structures.

1. Hi Egon,

I think we'll probably discuss how to fix this problem over at PMRs blog, so I won't say too much here. Though I will point out that CrystalEye creates RSS (with enclosures linking to the CML for each entry - this is only available in the Atom versions) and CMLRSS where the CML is inline. If you click through from here:

http://wwmm.ch.cam.ac.uk/crystaleye/feed/index.html

you'll see that each feed comes in six different versions, with there being three versions each for RSS and CMLRSS. There are almost 60,000 feeds maintained by CrystalEye at the moment!

cheers,
Nick

2. Ah, you do got CMLRSS? Excellent. Sorry for having overlooked that!

3. Egon, I 've being looking at the CrystalEye Open Data in order to try and connect to it from ChemSpider but have run into a whole series of problems that I outlined on my blog tonight (http://www.chemspider.com/blog/struggling-to-scrape-crystaleye.html). They include confusions between SMILES and inChI in stereochemistry and the nature of InChIs from unit cells than for structures. I will look into it further in the future but you might want to see whether the SMILES/InChI issue is from the CDK.

4. ChemSpiderMan... nice post. Not sure if CrystalEye has a bug tracker, but this would be time to set on up :)

Regarding SMILES-InChI in relation to stereochemistry and the CDK. According to the JavaDoc, the SMILES parser does not process stereoinformation [1], which could explain the information loss.

1.http://cheminfo.informatics.indiana.edu/~rguha/code/java/nightly/api/org/openscience/cdk/smiles/SmilesParser.html

5. Just fyi we use Bugzilla for tracking bugs for ChemSpider. It was very easy to rollout and a slight adjustment in work procedures allowed us to fit to it. It works very well for us. It is not made public. It may be released to the public in some way in the future but with such limited resources we cannot hope to manage all the questions that would come from people wading through that system too.