Thursday, December 20, 2007

The molecular QSAR descriptors in the CDK

Pending the release of Bioclipse 1.2.0, Ola asked me to do some additional feature implementation for the QSAR feature, such as having the filenames as labels in the descriptor matrix. See also these earlier items: (How more open notebook science can you get?)

But I ran into some trouble when both JOElib and CDK descriptors were selected, or Ola really. Now, nothing much I plan to do on the JOElib code, but at least I code investigate the CDK code.

The QSAR descriptor framework has been published in the Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. paper (DOI:10.2174/138161206777585274). However, while most molecular descriptors had JUnit tests for at least the calculate() method, a full and proper module testing was not set up. This involves a rough coverage testing and test methods for all methods in the classes.

So, I set up a new CDK module called qsarmolecular, and added the coverage test class QsarmolecularCoverageTest. This class is really short and basically only requires a module to be set up, as reflected by the line:
private final static String CLASS_LIST = "qsarmolecular.javafiles";
The actual functionality is inherited from the CoverageTest. The coverage testing requires, unlike tools like Emma for which reports are generated by Nightly, a certain naming scheme (explained in Development Tools. 1. Unit testing in CDK News 2.2).

Now, testing for a lot of the methods in the IMolecularDescriptor and IDescriptor interfaces are actually identical for all descriptors. Therefore, I wrote a MolecularDescriptorTest and made all JUnit test classes for the molecular descriptors extend this new class. This means that by writing only 10 new tests, with 29 assert statements, for the 45 molecular descriptor classes, 450 new unit tests are run without special effort, making to total sum of unit tests run each night by Nightly for trunk/ pass the 4500 unit tests.

Now, this turned out to be necessary. I count 52 new failing tests, which should hit Nightly in the next 24 hours.


  1. What are the problems you are mentioning?

  2. Some inconsistency between calculating the number of calculated values (summed vector lengths) versus that actually found.

  3. Can you give me an example. The things I tried (using your interface) worked nicely.

    Please CC to my private mail.

    And JOELib has some feature and SMARTS filter classes accepting threshold values. Do you see any value of those? If yes, how might they be used within bioclipse? Could molecule files be flagged using those filters? What is with efficiency? Which classes would be responsible for caching content?

    Here is an example for converting molecular file formats.

  4. How to do the descriptor calculations with joelib2 from the browser?

  5. Anonymous... where does the browser come in? What did you have in mind? Web services?

  6. actually i want to make a web server which can do the descriptor calculation? IS it possible to do such thing with joelib?

    by the way, i am vikash

    thanks for ur reply..

  7. Vikash, surely that is possible. Many approaches possible: SOAP, REST, HTML using JSP. I will shortly write up some experiences with JSP.

  8. thanks for the reply.
    Actually i already have an APACHE server.I also have a TOMCAT server for running JSP.Now i want to do the descriptor calculation with joelib on these servers.
    So what i can do now?