Sunday, June 07, 2015

CDK Literature #7

CC-BY-SA from WikiMedia.
Despite evidence that it does not make sense to aim for something, I did it again: I aimed at discussion some five CDK-citing papers each week. That was three weeks ago, and I don't really have time today either. But let me cover a few, so that I do not get even further behind.

Subset selection in QSAR modeling
We (intuitively) know that negative data is important for statistical pattern recognition and modelling. We also know that literature is not quite riddled with such data. This paper, however, studies the effect of using sets of inactive compounds in modelling and particularly the part about selecting which compounds should go into the training set. Like with the positive compounds, in the results of this paper too, the selection method matters. The CDK is used to calculate fingerprints.

Smusz, S., Kurczab, R., Bojarski, A. J., Apr. 2013. The influence of the inactives subset generation on the performance of machine learning methods. Journal of Cheminformatics 5 (1), 17+. URL

Using fingerprints to create clustering trees

I need to read this paper by Palacios-Bejarano et al. in more detail, because it seems quite interesting. The use fingerprints to make clustering trees, which, if I understand it correctly, be used to calculate similarities between molecules. That is used in QSAR modeling of the CLogP, and the results suggest that while MCS works better, this approach is more robust. This paper too uses the CDK for fingerprint calculation.

Palacios-Bejarano, B., Cerruela Garcia, G., Luque Ruiz, I., Gómez-Nieto, M., Jun. 2013. An algorithm for pattern extraction in fingerprints. Chemometrics and Intelligent Laboratory Systems 125, 87-100. URL