Wednesday, August 05, 2009

Dear Advisory Board: which QSAR descriptors would you like to see implemented in the CDK?

Dear Advisory Board,

Ola Spjuth has recently been working on a extensive QSAR environment in Bioclipse, and molecular descriptors are provided using remote services but also using the CDK. The CDK has a relatively large collection of QSAR descriptors, but certainly not the full list discussed in the Handbook of Molecular Descriptor.

I'm sure everyone would appreciate a few more descriptors, and I am wondering which ones you would assign priority to. So:

which QSAR descriptors would you like to see implemented in the CDK?

Looking forward to hearing from you, preferable as comment in this blog, or via email to cdk-user mailing list or directly to me otherwise. Make sure to include a full reference to the paper that describes the algorithm.

Kind regards,



  1. Hi Egon,
    I would opt for more 3D or 4D descriptors, right now see CDK Descriptor GUI (thanks to Rajarshi).

    There are (n=124) 3D sensitive descriptors and (n=164) CDK 0D, 1D and 2D effective descriptors.

    But 2D descriptors you can get for free from MOLD2 (not open source) or some from Marvin (free academic) or BlueDesc.

    Some more important issues:

    A) Also the CDK speed is still mediocre, here I opted for Dragon, or ISISA/SMF. Calculating some 100k molecules or some millions just takes too long. It improved but its still slow. Parallelism can be better implemented on the molecule handling side, fire 8 threads for 8 CPUs, handling 8 molecules at once.

    B) Hydrogen, aromaticity and nitro, nitroso, salt handling etc. these are important issues which have to be tested and validated for diverse sets of molecules. I remember some of the CDK values were different from Dragon, I used the NCI2000 test set (pretty diverse).

    C) Are there JUNIT test for descriptor tests? I don't know how to code them but I know the concept. To ensure high quality.
    Its important. I would use NCI2000 test set as SDF, the SMILES are bit messed up here:

    Or maybe a boiled down version or selection of bad molecules, including a wide variety of functional groups and

    D) I would (If I could) start with computationally cheap descriptors, ANY descriptor, 2D or preferably 3D or 4D. The CDK currently covers already a wide range of 2D descriptors. Basically going for the low hanging fruits, one by one.

    E) What about AMBIT descriptors, those are LPGL too, so they can be included as is, so they are in one package.

    F) What about the JOELIB descriptors, they are in LPG license, are they all covered?

    In Conclusion, a valid test set, high speed calculation and possibility of handling diverse molecules with accurate and correct results is most important for me.

    CDK 3D sensitive descriptors (n=124):
    GRAV-4, GRAV-1, GRAVH-1, MOMI-X, MOMI-Y, MOMI-Z, DPSA-2, PNSA-2, WNSA-2, PPSA-2, WPSA-2, TPSA, PNSA-1, WNSA-1, PPSA-1, WPSA-1, DPSA-1, DPSA-3, GRAV-5, WV.unity, THSA, PNSA-3, WNSA-3, GRAV-2, GRAVH-2, PPSA-3, WPSA-3, WA.unity, GRAV-6, GRAV-3, GRAVH-3, WT.unity, MOMI-XZ, MOMI-YZ, Wlambda3.unity, MOMI-R, Wlambda2.unity, Wlambda1.unity, RPCS, RNCS, LOBMIN, LOBMAX, geomShape, WD.unity, WK.unity, Weta2.unity, Weta1.unity, Wnu2.unity, FPSA-2, FNSA-2, MOMI-XY, Wnu1.unity, Weta3.unity, RHSA, RPSA, FPSA-1, FNSA-1, FNSA-3, FPSA-3, ATSp5, ATSp4, ATSp1, ATSp2, MW, AMR, fragC, ATSm3, WTPT-3, WTPT-4, ATSm4, Kier1, WTPT-1, ALogp2, BCUTw-1l, BCUTw-1h, apol, ATSm1, SP-0, VP-0, SPC-5, SPC-6, MDEC-33, TopoPSA, MDEO-12, bpol, SP-2, SP-5, Kier2, MDEC-23, SP-6, XLogP, BCUTp-1l, VP-5, VPC-5, MDEO-11, VAdjMat, SP-7, WTPT-2, BCUTc-1l, ATSc1, SC-3, VP-4, VPC-6, Kier3, SC-5, VP-6, VPC-4, MDEC-22, MDEC-24, MDEO-22, ATSc3, ATSc4, SCH-6, VC-3, RNCG, VC-5, SC-6, ATSc5, SCH-5, VCH-7, VCH-5, RPCG, SC-4, VC-4

    CDK 1D and 2D effective descriptors (n=164) Kier&Hall compressed:
    ALogP, BCUTc-1h, BCUTp-1h, Wgamma1.unity, Wgamma2.unity, Wgamma3.unity, WG.unity, nA, nR, nN, nD, nC, nF, nQ, nE, nG, nH, nI, nP, nL, nK, nM, nS, nT, nY, nV, nW, naAromAtom, nAromBond, nAtom, ATSc2, ATSm2, ATSm5, ATSp3, nB, C1SP1, C2SP1, C1SP2, C2SP2, C3SP2, C1SP3, C2SP3, C3SP3, C4SP3, SCH-3, SCH-4, SCH-7, VCH-3, VCH-4, VCH-6, VC-6, SP-1, SP-3, SP-4, VP-1, VP-2, VP-3, VP-7, SPC-4, ECCEN, nHBDon, nHBAcc, Kier and Hall Series , nAtomLC, nAtomP, LipinskiFailures, nAtomLAC, MDEC-11, MDEC-12, MDEC-13, MDEC-14, MDEC-34, MDEC-44, MDEN-11, MDEN-12, MDEN-13, MDEN-22, MDEN-23, MDEN-33, PetitjeanNumber, topoShape, nRotB, WTPT-5, WPATH, WPOL, Zagreb

  2. Hi,
    correction to the list above
    3D and 2D: EXCEL and OpenOffice calculate the stdev for 64 * (-3.27699999999999) to be
    OpenOffice (avedev)
    EXCEL (stdev)

    instead of zero.

    So please ignore the numbers
    CDK 3D sensitive descriptors (n=124) and
    CDK 1D and 2D effective descriptors (n=164)

    Keywords: IEEE 754, accuracy issue, sucrose stereoisomers