Dear Advisory Board,
Ola Spjuth has recently been working on a extensive QSAR environment in Bioclipse, and molecular descriptors are provided using remote services but also using the CDK. The CDK has a relatively large collection of QSAR descriptors, but certainly not the full list discussed in the Handbook of Molecular Descriptor.
I'm sure everyone would appreciate a few more descriptors, and I am wondering which ones you would assign priority to. So:
which QSAR descriptors would you like to see implemented in the CDK?
Looking forward to hearing from you, preferable as comment in this blog, or via email to cdk-user mailing list or directly to me otherwise. Make sure to include a full reference to the paper that describes the algorithm.
Kind regards,
Egon
Hi Egon,
ReplyDeleteI would opt for more 3D or 4D descriptors, right now see CDK Descriptor GUI (thanks to Rajarshi).
http://rguha.net/code/java/cdkdesc.html
There are (n=124) 3D sensitive descriptors and (n=164) CDK 0D, 1D and 2D effective descriptors.
But 2D descriptors you can get for free from MOLD2 (not open source) or some from Marvin (free academic) or BlueDesc.
Some more important issues:
A) Also the CDK speed is still mediocre, here I opted for Dragon, or ISISA/SMF. Calculating some 100k molecules or some millions just takes too long. It improved but its still slow. Parallelism can be better implemented on the molecule handling side, fire 8 threads for 8 CPUs, handling 8 molecules at once.
B) Hydrogen, aromaticity and nitro, nitroso, salt handling etc. these are important issues which have to be tested and validated for diverse sets of molecules. I remember some of the CDK values were different from Dragon, I used the NCI2000 test set (pretty diverse).
C) Are there JUNIT test for descriptor tests? I don't know how to code them but I know the concept. To ensure high quality.
Its important. I would use NCI2000 test set as SDF, the SMILES are bit messed up here:
http://cactus.nci.nih.gov/ncidb2/download.html
http://cactus.nci.nih.gov/DownLoad/NCI_aug00_SMI.sdz
Or maybe a boiled down version or selection of bad molecules, including a wide variety of functional groups and
D) I would (If I could) start with computationally cheap descriptors, ANY descriptor, 2D or preferably 3D or 4D. The CDK currently covers already a wide range of 2D descriptors. Basically going for the low hanging fruits, one by one.
E) What about AMBIT descriptors, those are LPGL too, so they can be included as is, so they are in one package.
F) What about the JOELIB descriptors, they are in LPG license, are they all covered?
In Conclusion, a valid test set, high speed calculation and possibility of handling diverse molecules with accurate and correct results is most important for me.
Cheers
Tobias
-------------------------------
CDK 3D sensitive descriptors (n=124):
GRAV-4, GRAV-1, GRAVH-1, MOMI-X, MOMI-Y, MOMI-Z, DPSA-2, PNSA-2, WNSA-2, PPSA-2, WPSA-2, TPSA, PNSA-1, WNSA-1, PPSA-1, WPSA-1, DPSA-1, DPSA-3, GRAV-5, WV.unity, THSA, PNSA-3, WNSA-3, GRAV-2, GRAVH-2, PPSA-3, WPSA-3, WA.unity, GRAV-6, GRAV-3, GRAVH-3, WT.unity, MOMI-XZ, MOMI-YZ, Wlambda3.unity, MOMI-R, Wlambda2.unity, Wlambda1.unity, RPCS, RNCS, LOBMIN, LOBMAX, geomShape, WD.unity, WK.unity, Weta2.unity, Weta1.unity, Wnu2.unity, FPSA-2, FNSA-2, MOMI-XY, Wnu1.unity, Weta3.unity, RHSA, RPSA, FPSA-1, FNSA-1, FNSA-3, FPSA-3, ATSp5, ATSp4, ATSp1, ATSp2, MW, AMR, fragC, ATSm3, WTPT-3, WTPT-4, ATSm4, Kier1, WTPT-1, ALogp2, BCUTw-1l, BCUTw-1h, apol, ATSm1, SP-0, VP-0, SPC-5, SPC-6, MDEC-33, TopoPSA, MDEO-12, bpol, SP-2, SP-5, Kier2, MDEC-23, SP-6, XLogP, BCUTp-1l, VP-5, VPC-5, MDEO-11, VAdjMat, SP-7, WTPT-2, BCUTc-1l, ATSc1, SC-3, VP-4, VPC-6, Kier3, SC-5, VP-6, VPC-4, MDEC-22, MDEC-24, MDEO-22, ATSc3, ATSc4, SCH-6, VC-3, RNCG, VC-5, SC-6, ATSc5, SCH-5, VCH-7, VCH-5, RPCG, SC-4, VC-4
CDK 1D and 2D effective descriptors (n=164) Kier&Hall compressed:
ALogP, BCUTc-1h, BCUTp-1h, Wgamma1.unity, Wgamma2.unity, Wgamma3.unity, WG.unity, nA, nR, nN, nD, nC, nF, nQ, nE, nG, nH, nI, nP, nL, nK, nM, nS, nT, nY, nV, nW, naAromAtom, nAromBond, nAtom, ATSc2, ATSm2, ATSm5, ATSp3, nB, C1SP1, C2SP1, C1SP2, C2SP2, C3SP2, C1SP3, C2SP3, C3SP3, C4SP3, SCH-3, SCH-4, SCH-7, VCH-3, VCH-4, VCH-6, VC-6, SP-1, SP-3, SP-4, VP-1, VP-2, VP-3, VP-7, SPC-4, ECCEN, nHBDon, nHBAcc, Kier and Hall Series , nAtomLC, nAtomP, LipinskiFailures, nAtomLAC, MDEC-11, MDEC-12, MDEC-13, MDEC-14, MDEC-34, MDEC-44, MDEN-11, MDEN-12, MDEN-13, MDEN-22, MDEN-23, MDEN-33, PetitjeanNumber, topoShape, nRotB, WTPT-5, WPATH, WPOL, Zagreb
Hi,
ReplyDeletecorrection to the list above
3D and 2D: EXCEL and OpenOffice calculate the stdev for 64 * (-3.27699999999999) to be
OpenOffice (avedev)
0.0000000000000288657986402541
EXCEL (stdev)
0.000000000000002685599106721010
instead of zero.
So please ignore the numbers
CDK 3D sensitive descriptors (n=124) and
CDK 1D and 2D effective descriptors (n=164)
Keywords: IEEE 754, accuracy issue, sucrose stereoisomers
Cheers
Tobias