Wednesday, February 09, 2011

Chemical data curation: yes, it is that bad.

The readers of Antony's blog know enough about the problem. And many in the QSAR community know it too (and many other do not). Chemical structure data is noisy. I haven't recently created a new local data set for analysis, so I have not taken time to blog about it much, but the ambiguity in chemical databases is enormous. Just yesterday, Antony and I had a good discussion about tautomers and in particular how things are linked together.

If we are in the field of property prediction, knowing what tautomer to calculate descriptors for is crucial. Not that we actually have easy access to experimental data showing what the important tautomer is for our end-point (predicted property), but at least we can track what tautomer we modeled with. Has everyone ever asked you to add units to experimental values? Like "the temperarature was 279 degrees; Celsius or Kelvin??" Well, this is the exact same thing. If your QSAR model training report does not include that information, you are doing it wrong. (/me ducks)

So, why does it in fact matter? It matter simply because calculated properties are different. Backing up to the ChemSpider example in my question about InChIs with the fixed-hydrogen layer I noted that (like in many other databases) the synonyms seems to include IUPAC names for at least two tautomers. However, while the ChemSpider is, in fact, for the tautomer-independent structure (using the InChI mobile hydrogen layer; and keep in mind that the InChI uses only a limited amount of heuristic rules for identifying tautomers, making it not detect all 40 tautomers of warfarin), the 2D diagram, the 3D model, and the calculated properties reflect only one tautomer.

And calculated properties are exactly the input in QSAR's statistical modeling. It is interesting to realize that the differences in calculated molecular descriptors can vary both minimally, or not at all, as drastically. Very drastically, in fact. The recent paper by Porter (doi:10.1007/s10822-010-9335-7) shows the 40 warfarin tautomers, and discusses a few properties, such as the pKa. The experimental pKa of warfarin is around 5. Now, the paper reports calculated pKa values for a variety of software products (AMBIT is unfortunately missing). First of all, it shows that the various tools differ, which is to be expected. But that variance is neglectable when compared by the effect of picking the wrong tautomer. I was impressed by the range of predicted values for the various tautomers. I ranged from about 5 to 12, throughout all tools. That means warfarin is predicted to be mildly acidic (some tools predict pKa's down to 2.5) to very basic! No way your statistical modeling will understand that!

And this is why Open Data is so important in chemistry. So, the next time Joe (Organic) Chemist bitches about computers and cheminformatics, tell him it is his own fault: he should have released his data out in the Open.

Anyway. Tautomerism was a curation issue in the first(!!!) entry I was curating. The sixth had the more well-known problem, I think. I may be blind, but I would say this drug has a stereocenter:

But none of the databases I checked so far (including ChemSpider) defines the stereochemistry! I thought we settled that some decades ago? Stereochemistry of drugs matter. What is going on here? I guess I have to browse some primary literature and access some experimental data today then. If I can afford it.

Porter, W. (2010). Warfarin: history, tautomerism and activity Journal of Computer-Aided Molecular Design, 24 (6-7), 553-573 DOI: 10.1007/s10822-010-9335-7


  1. I think the drug is on the market as a racemate so okay drawn as is? I commented on the blog...

  2. I can deal with ambiguity. Far worse are errors such as wrong isomer, wrong structure and wrong experimental data that are common in some--to remain nameless--commercial databases.

  3. @ChemSpiderman: yes, a racemate, that's what other told me too. None of the databases told me this.

    @Rich, indeed. *I* can deal with this kind of ambiguity too. Any scientist can.

    However, my 'workflow' can't. My software is stupid: it doesn't know how to resolve that ambiguity. My QSAR training model doesn't know that it needs to create the two stereoisomers, both of which it needs to correlate to the associated activities.

    And that's the whole point here. While this may not always be clear, I tend to think from this simple algorithmic perspective.

  4. @Egon: Me too. :-)

    The problem with tautomers--as you noted--is that there isn't software that generates all of them and that the data doesn't necessarily map to a single, or even a distribution of, the tautomers responsible for the reported property. This is where the trouble with tautomers starts to overlap with what to me is the broader and more nefarious problem with the data: it's too often incorrect or stated another way, the structure reported is not responsible for the data.