Friday, December 23, 2005

Subset selection: mind the complexity

In a recent JCIM article, Schuffenhauer compares a few subset selection methods, and notes that some of them reduce the average complexity of the molecules. They put this in relation to other research that states that lead compounds with high complexity have higher activities. Recommended reading material for the holidays.


  1. Know it's a bit late for this comment, but don't you think that at least some of his findings are an artifact? Given that the usual corporate database is highly biased towards larger molecules, I would expect the smaller ones to be sort of 'outliers' and therefore selected more often ... at least with optisim like selection methods ...

  2. Hi hik,

    if a data set is indeed skewed, that is not normally distributed, in it the molecular parameters, then subset selection methods will give a bias. If the median is at a higher complexity or size, then this bias is likely towards lower complexity and size, if there is not correction for this.

    Depending on your point of view, this skewedness can be seen as an artifact, but I prefer to see this as a data property. The article indicates that the benchmarked selection methods do have this bias and that this bias might be relevant to that actual studies.

    Now, I don't have much hands-on experience with commercial data bases of compounds, nor in drug design, so can't say much on at based on intuition, but do think the article nicely indicated that data mining and modelling is not as easy as pushing a button.

    I am very interesting in knowing which findings you think are most likely to be caused by artifacts, and discuss how that could be proven.

  3. Sorry Nik... typo. Too bad I cannot modify the comments :(