Thursday, December 20, 2007

The molecular QSAR descriptors in the CDK

Pending the release of Bioclipse 1.2.0, Ola asked me to do some additional feature implementation for the QSAR feature, such as having the filenames as labels in the descriptor matrix. See also these earlier items: (How more open notebook science can you get?)

But I ran into some trouble when both JOElib and CDK descriptors were selected, or Ola really. Now, nothing much I plan to do on the JOElib code, but at least I code investigate the CDK code.

The QSAR descriptor framework has been published in the Recent developments of the chemistry development kit (CDK) - an open-source java library for chemo- and bioinformatics. paper (DOI:10.2174/138161206777585274). However, while most molecular descriptors had JUnit tests for at least the calculate() method, a full and proper module testing was not set up. This involves a rough coverage testing and test methods for all methods in the classes.

So, I set up a new CDK module called qsarmolecular, and added the coverage test class QsarmolecularCoverageTest. This class is really short and basically only requires a module to be set up, as reflected by the line:
private final static String CLASS_LIST = "qsarmolecular.javafiles";
The actual functionality is inherited from the CoverageTest. The coverage testing requires, unlike tools like Emma for which reports are generated by Nightly, a certain naming scheme (explained in Development Tools. 1. Unit testing in CDK News 2.2).

Now, testing for a lot of the methods in the IMolecularDescriptor and IDescriptor interfaces are actually identical for all descriptors. Therefore, I wrote a MolecularDescriptorTest and made all JUnit test classes for the molecular descriptors extend this new class. This means that by writing only 10 new tests, with 29 assert statements, for the 45 molecular descriptor classes, 450 new unit tests are run without special effort, making to total sum of unit tests run each night by Nightly for trunk/ pass the 4500 unit tests.

Now, this turned out to be necessary. I count 52 new failing tests, which should hit Nightly in the next 24 hours.