Saturday, August 20, 2011

Speeding up the CDK: Morgan numbers

Thorsten Flügel found a nice speed up for the CDK as part of the work in Dortmund on Scaffold Hunter: calculation of Morgan numbers. He has actually written a set of patches, and analyzed several bottlenecks. I expect more of that work to enter the CDK. Below is my observation of the speed up:

The patch for this has been pushed to cdk-1.4.x now.

Calculation of Morgan numbers is used (canonical) SMILES generation, but also in the isomorphism checker, so the performance boost is probably going to show up at many places. Got numbers? Blog them!


  1. atomContainer.getConnectedBonds() and atomContainer.getConnectedBondsCount() are bottlenecks in the atomcontainer ( I explored it long back while developing SMSD.)

    That's one of the reasons I use to speed up the look up part.

    Another option which I tried was which works on the Adjacency theory and hence look ups are faster.

    Now present Atomcontainer is memory efficient but Adjacency based atomcontainer has better speed.

    Again it all depends on the classical comp sc question.... memory vs speed...

  2. And, this is exactly why we have the interface and their implementations split apart, and why I am working towards making all CDK modules not depend on the data module. That way, we can have two implementations, one optimized for speed, one optimized for memory usage.

    Now, I did play in the past with alternatives for getConnectedBonds() too, but never found a solution at the time that demonstrated clear performance boosts, though I have may have simply been testing on the wrong data :)

  3. "interface" REALLY HELPS!

    By the way getConnectedAtoms() too suffers from this lag.

    Yes it would be nice to have two ways to deal with it rather than the IMolecule and IAtomcontainer which is still confusing for me (connectivity checker is not implemented) ...rather they can support two different optimised for graphs and other for general use (basic) :-)

  4. IMolecule(/Set) is going to disappear from the master branch soon.

    But that would be a bad place to differentiate between implementations. That should be at the implementation side, not defined by interfaces; not in this case, IMHO.