Sunday, June 20, 2010

Looking at your statistical models...

I do not think I have ever blogged the paper that played an important role in my thesis (doi:10.1021/ci990038z); research of one of the papers in my thesis, started with the hypothesis proposed therein. The paper had a really good idea; but, unfortunately, it did not contain the data to support the hypothesis. That gets me to one important lesson I learned: a QSAR data set of less than 100 molecules is not enough to make untargeted statistical models.

The paper reads quite nicely, and the results are clear: by combining spectral types, the RMSEP goes down. Good! Lower prediction errors; that's what we all want. So, a M.Sc. student of mine set off, but after about half a year, he was still unable to make statistically good models. He used bootstrapping to 'prove' it was not his fault: there was not enough data for the method to learn the underlying patters. Hence my above lesson. My student went on with larger data sets, and laid out the foundation of what later became the paper on using NMR spectra in QSPR modeling (doi:10.1021/ci050282s). Now you understand why QSAR is missing.

So, if those results are so clear, then why does it not work? As said, the data set was too small for pattern recognition methods to see what was going on. The RMSEP numbers just came out nicely; however, if we had only made the below plot, if would have warned. But I failed to do that at the time. Lesson learned: do not just look at the data, but also look at the model. And look really means looking with your eyes at graphical representations of that model. The plot:

The numbers in this plot are hidden in tables in the paper. The RMSEP values earlier mentioned are calculated from those. From the plot, you can see that the test data consisted of 5 compounds; the training set contained 37 compounds; all are congenerics, and do not span a high diversity. Now, the plot shows five models: black is COMFA; orange is based on experimental IR spectra; red, green, and blue are models where two types of representations are combined. From the RMSEP values it can be seen that combining representation improves the RMSEP values. That's what you want, and sort of makes sense.

Now, I did not make this plot until I started writing up the paper, and tried to figure out why the QSAR data set did not work. My eyes opened wide when I saw the orange dots! Anti-correlation! WTF?!?! I mean, we are looking at a plot visualizing the predicted versus the experimental activity... Actually, the others are not really convincing either, are they? Looking at the predictions for compounds with experimental values around 1.0-1.5 (if you really want to know the unit, read the article), the pattern is pretty much anti-correlated too. Thinking about it, it seems the RMSEP is mainly reflecting the error of the left most compound, the one with a experimental activity of about 0.3.

Clearly, the orange model is hopeless, but the others are not really better. Now, the paper actually makes statements comparing the various combinations of representations, but, in retrospect and looking at this plot, I wonder if the green model is really different from the blue or red models.

Since then, I always make these kind of plots, just to see what my model is like. Since then, I distrust papers that only show RMSEP, Q2, or other quality statistics. Now, the tricky part is, you need those statistics if you want to automate model selection; the variance on those model quality statistics is actually so high (see also my other post today), that you must carefully validate that model selection too, visually of course.

I have been long thinking what to do with these observations. I did not dare publish them in my thesis; I did not dare write a letter to the editor. Perhaps I should. But even writing up this blog makes me feel uncomfortable. Besides the fact that I might be wrong, I also do not like to point out mistakes (IMHO); particularly, when those are published in a respectable journal. I was fooled by the statistics too (and was already well trained), so I cannot comment on the authors overlooking the issue. Or the reviewers! Or the community at large. Also, I do not know what the fate of this paper should be. The idea is quite interesting, even though the published results do not support it. Not shown here, but the bootstrapping results show that the apparent slight improvement is merely a numerical artifact, just happening by chance, based on luckily selecting the test compounds; the data is just insufficient in size to draw any conclusion.

Comparative Spectra Analysis (CoSA): Spectra as Three-Dimensional Molecular Descriptors for the Prediction of Biological Activities Journal of Chemical Information and Modeling, 1999, 39 (5), 861-867 DOI: 10.1021/ci990038z
Willighagen, E., Denissen, H., Wehrens, R., & Buydens, L. (2006). On the Use of 1H and 13C 1D NMR Spectra as QSPR Descriptors Journal of Chemical Information and Modeling, 46 (2), 487-494 DOI: 10.1021/ci050282s


  1. I assume you know the article, Beware of Q2
    Alex told me that he met quite some angry people since he used their articles as an example so I can see why you are a bit hesitating. I think he now picks his examples a bit more carefully :-)


  2. This is science, not happy-face. If there is an error, then it should be pointed out. If you aren't sure that there is an error, then you should send a note to the authors.

    You don't have to trumpet this from the rooftops and a nice word leading into the criticism will go miles toward making the original authors receptive about the critique, but it is really important to let people know that there seems to be a problem.

    If the authors have made a serious error, then it is incumbent on them to let the journal know. I wouldn't go on a crusade to get them to do that, however, even though few people are that principled.

  3. @Peter: sure, a very good read indeed.

    @Ted: I do not think the paper is actaully wrong. The results, however, do not support (not disprove) the conclusions. They 'hint' at the observed and proposed trend.

    Moreover, the problem did merely get overlooked. Like it did for me, and the reviewers too. Overlooking is very easy; it happens to many of us.

    The authors put forward a really interesting idea, and proposed a method to show it. As such, the paper is still very much worth reading.

    *In retrospect*, the paper could have been done better. But that is just science.

    I actually did speak one of the co-authors on the problems I found. And, honestly, something like a letter to the editor would be overdone. There is no mistake in the paper, just a very optimistic conclusion, not really supported by the data.

    The point of the blog is primarily what the title says: look at your model in as many ways as you can, and do not stop at the numerical statistics measures.

  4. So, what I now finally did, was record my reservations.

    I hope I did not make big mistakes in my publications, but invite everyone here to read them, and if there are problems, to point them out.

    The underlying, general issue, is that science has become so specialized, and we have so much knowledge now, that by not setting up collaborations with the right people, you end up doing stuff in an area where you are not expert.

    And most scientists do not show they care. We are starting to see changes in publishing, but there are some huge gaps still. How do we record that a paper has limitations? Citations are not yet labeled (check out Shotton's CiTO!), and there is no central place to socially bookmark such information anyway.

    There are good papers; there are bad papers. We know about journals with good reviewers, and those without bad reviewers. Still, protein structures published in Nature and Science are of (relatively) bad quality.

    We have the perfect technical means of reducing the number of mistakes; we could even simply set up a piece of software that validates QSAR models. But this is not happening.

    Open Data, Open Source, Open Standards: solves many problems. Such a QSAR validation service could pick up limited QSAR models easily, if the model's data was Open Data and using Open Standards.

    No rocket science, really. But too difficult for the scientific community. Sad, if you think about it.

    Then again, there are so many sad things about science. Why is it that we force all the smart scientists to not do science, but do copy/editing work??

  5. You speak from my heart. In 1974 I studied a reaction that turned out to be a self initiated polymerization: For the kinetic calculations I used one of the first HP computer. After always removing points from the beginning and the end of the dataset, I used pen and paper to make a plot. In the end the kinetics were of uneven order, and all points fit the curves.

    Visualization is extremely important, a few more examples:

    Only visulization revealed that many datapoints were incorrect due to insufficient solubility of the compounds.

    Comparison of two version of a programs revealed a tremendous jump in accuracy of prediction, or you could also say wrong confindence in the previous version.

    Alexander Kos