Saturday, April 07, 2012

A typical QSAR study (cito:citesAsAuthority)

I use CiTO to keep track of how the CDK is cited and used, and just looked at a typical QSAR paper. Here are my comments on "Study of indole derivative inhibitors of Cytosolic phospholipase A2α based on Quantitative Structure Activity Relationship", by Lu et al (doi:10.1016/j.chemolab.2011.11.011). Normally, I am fairly short in these reviews which I publish via the CDK Google+ page, briefly describing what CDK functionality is being used. But this time the post became a more substantial review, so decided to put it here too, and use ResearchBlogging which I haven't done in a while.

The paper by Lu et al is typical QSAR paper, with less than 50 compounds, hundreds of descriptors, and some machine learning. They cite the CDK as a free tool to calculate descriptors, but use something else. The article compares PLS, ANN, and SVM, in the typical bad way, by not splitting out the effect of the kernel (RBF) from the regression model, making the comparison pretty uninformative.

If I scanned the paper correctly, they use a single test set, with LOO cross-validation for modeling method parameter estimation. The test set compounds are picked at the outer sides of the end point range, and no information is given on the variance in R2 and Q2 statistics. BTW, these two statistics are surprisingly close to each other (for each method separately). I wonder if that applies to all possible test sets, and some bootstrapping seems in order here.

Also, stepwise MLR was used for descriptor selection, thus prior to statistical modeling, and it seems to me PLS, ANN, and SVR was performed in this subset! Well, that makes the comparison even less relevant, as PLS does not require such prior selection. Moreover, it is know the stepwise MLR easily leads to local minima, not to the most optimal combination of descriptors.

ResearchBlogging.orgLu, X., Ji, D., Chen, J., Zhou, X., & Shi, H. (2012). Study of indole derivative inhibitors of Cytosolic phospholipase A2α based on Quantitative Structure Activity Relationship Chemometrics and Intelligent Laboratory Systems DOI: 10.1016/j.chemolab.2011.11.011


  1. For the question is open:
    1. What to do with statistically badly written papers?
    2. What to do when the paper actually repeat the work already done 1-2-5-10 years ago, and doesn't have any new results (better Q2, etc)?

    Once I have send the letter to editor about some really badly written paper (, but do not received any comments.

    1. Vladimir, you could have a look at the 'forensic bioinformatics' work by K. Baggerly:

  2. With so many QSAR papers out there, could you point us to some citations that you would categorize as "good" QSAR papers? I'm an medicinal chemist in the middle of trying to build QSAR models, and a solid example would be quite helpful.

    1. Anynomous, a very fair point. There are a few good papers in this area, and I will soon aggregate those and summarize them in a blog. Validation is particularly important, and most important to focus on. "Beware of Q^2!" is a good read, for example. Make sure to always visualize your predictions, e.g. in a y_pred versus y_measured scatter plot.

      For the paper discussed in the blog post, I can recommend the following. 1. with less than 100 compounds, there is a lot of freedom for your statistical method to come up with a model. Make sure to use a good independent test set, and to use cross-validation on the training set to find good parameter settings. You can use y-randomization and bootstrapping to get estimates of the predictive power of a random numerical model would give (thus a model with no cause-effect relationship).

      Another aspect of this paper was that they compared two methods, but confounded the regression method (SVR and PLS) and the kernel used to make a non-linear space linear. Keep in mind that SVR is a linear regression method, and that non-linear kernels can be applied to PLS too (not to be mistaken with kernel-PLS which is a method where the PLS algorithm is reformulated to be more efficient). I just got alerted this morning about this book chapter by the Zell group, which you will probably like, if you want to learn more about PLS versus SVR:

    2. Thanks for the insight, and a few places to get started. I look forward to a future post. In my models, I've been getting good R^2 values (>0.9), but Q^2 is burning them to the ground (approx. 0.04). Frustrating.

    3. Anonymous, that sounds like you use too many variables with respect to the number of objects? Given enough variables, you can fit anything. A common referred to ratio is 4 to 5 objects per variable.

    4. @ Anonymous,since you mentioned about "good" QSAR models read this :

      QSAR: All models are wrong, but some are useful.