Saturday, August 13, 2011

Rough guide to writing CDK descriptors

Some time ago Andrew asked me how to write molecular descriptor implementations for the CDK. I have no such chapter in my book yet, and at that time wrote up a quick overview of the general steps. In the near future I will elaborate on those steps, but just to have them more easily recoverable, here are those steps as I replied on FriendFeed at the time:

1. get yourself a working CDK development environment in NetBeans or Eclipse (I got experience only with the last) - You
2. create a new class extending the interface: IMolecularDescriptorYou
(or IAtomDescriptor, IBondDescriptor) - You
3. register your descriptor in the descriptor ontology (a Blue Obelisk project) - You
3a: open the file descriptor-algorithms.owl and look for an existing <Descriptor>, e.g. atomCount - You
3b. not the resource ID, which must be unique, and represents a full URI like - You
3c. that URI you will use in your descriptor impl class .getSpecification() method, see e.g. in the - You
4. decide if your descriptor has parameters, which it does not have to - You
if it does not .getParameters() should return a zero length Object[] array - You
5. .getDescriptorNames() should return labels for each value you return, e.g. some descriptor algorithms return multiple values, like the BCUTDescriptor, while others return a single value, like the XLogP descriptorYou
oh, and I guess step: 0. decide which version you like to develop against. Recommended is against the cdk-1.4,x branch at this moment, but master is good too. If you must, cdk-1.2.x is possible too, which is against the current stable release.You
6. decide what your descriptor will return... also discussed in step 5. A single value? A double, or an integer? A boolean perhaps? Or an array of integers? this is what the methods .getDescriptorResultType() is about. - You
Check the implementations of BooleanResultType, DoubleArrayResultType, etc -You
if you return an array of values, make sure the length of getDescriptorNames() is the same! - You
7. implement .calculate() -> that's where your code will do its thing - You
this method will return a DescriptorValue, which will wrap provenance and the actual value - You
the actual value is an impl of IDescriptorResult, but make sure to take the subclass of the FooType. - You
That is, if your descriptor returns something of BooleanResultType, the actual value is a BooleanResult -You
Check existing impl, and don't be afraid for making mistakes in initial versions... we all do, like this bad code in the BCUT descriptor I just discovered :) -> - You
8. in case your descriptor calculation cannot be completed (e.g. needs 3D coordinate, but none in the input), it must return NaNs, ensuring that the resulting descriptor matrix keeps rectangular - You
That's more or less it... questions? - You
Thanks Egon - I'm sure I will have questions. :) - Andrew Lang
Make sure to check the Padel project, which already has a number of descriptors implemented in CDK API, but not yet ported to the CDK: We started an attempt here → - You
Oh, and add a unit test like this: (that will automatically detect inconsistency problems, and helps you get your implementation right) - You