## Tuesday, November 08, 2005

### When to stop including QSAR model variables...

Yesterday I reviewed an article which published a QSPR model which looked something like:

y = 151 + 50p1 - 12p2 - 0.006p3

with quite OK prediction results (R=0.9880). But I was not quite comfortable with the coefficient for the p3 variable. The article did not calculate significances for the coefficients, so it was not obvious from the article wether is was useful to include them. I then looked at the range for p3, which was 110-150; so, the maximal influence this variable can have is 150*0.006 = 0.9. Now, the experimental values given in the article were rounded to integers, indicating that the maximal
effect of the p3 variable is smaller than the experimental error! It's even worse when you consider the difference between the min and max value (40), then the influence would even be smaller (assuming that most model methods would put the mean temperature effect in the offset, 151 in this case).

Today, I reread an article with a similar issue. The model was something like:

y = -0.81 + 0.03*p1 + 0.009*p2

Here, max(p2)-min(p2) is a smaller than 100, so the maximal effect of the variable would be in the order 0.9, which is of the same order of the root mean square error of prediction (RMSEP) for this model. Indeed, the article already states that the coefficient is only significant at the 95% level, and not at the 99% level. But, without having calculated the RMSEP for a model without the p4 variable, I would guess that leaving it out would give equally good prediction results.

Concluding, I would say the the p2 variable does not include relevant information.

Do you think it is reasonable to include the p2 variable in the second model?

#### 1 comment:

1. I agree with the observation that without values of the t-statistic and corresponding p-values, its difficult to say whether p3 (or p2 in the second case) really has any effect on the model.

In my opinion, lack of these statistics makes the model meaningless - yes, you could evaluate the range of the variable and look at the maximal influence as you have done. I don't think that should be required when a statistical model is presented.

OK, enough of the rant!

One question I have: were the input variables scaled? If not, that would explain the magnitude of order differences in the coefficients. And in such a case, it would not be wise to discount p3 (or p2 in the second case), since it is possible that these variables are explaining some of the variance, but due to lack of scaling this would not be apparent.

On the other hand if the data was scaled, then yes, I would agree that p2 in the second model could probably be dropped.

Apart from the use of statistical tests, a quick way to check tha that the model is not overfit (ie p3 or p2 are not extraneous variables) run a PLS using the descriptors in the models. If overfitting is not occuring, all 3 components (or 2 in the case of the second model) will be validated. If not, then you know theres some wrong!

But in the end, as I said above, reporting a regression model without supporting statistics makes the use of the model pretty shaky.