Comments on chem-bla-ics: Subset selection: mind the complexity

Sorry Nik... typo. Too bad I cannot modify the com...

2006-01-27T15:22:00.000+01:00

Sorry Nik... typo. Too bad I cannot modify the comments :(

Hi hik,if a data set is indeed skewed, that is not...

2006-01-27T15:21:00.000+01:00

Hi hik,

if a data set is indeed skewed, that is not normally distributed, in it the molecular parameters, then subset selection methods will give a bias. If the median is at a higher complexity or size, then this bias is likely towards lower complexity and size, if there is not correction for this.

Depending on your point of view, this skewedness can be seen as an artifact, but I prefer to see this as a data property. The article indicates that the benchmarked selection methods do have this bias and that this bias might be relevant to that actual studies.

Now, I don't have much hands-on experience with commercial data bases of compounds, nor in drug design, so can't say much on at based on intuition, but do think the article nicely indicated that data mining and modelling is not as easy as pushing a button.

I am very interesting in knowing which findings you think are most likely to be caused by artifacts, and discuss how that could be proven.

Know it's a bit late for this comment, but don't y...

2006-01-26T12:27:00.000+01:00

Know it's a bit late for this comment, but don't you think that at least some of his findings are an artifact? Given that the usual corporate database is highly biased towards larger molecules, I would expect the smaller ones to be sort of 'outliers' and therefore selected more often ... at least with optisim like selection methods ...