Thursday, August 14, 2008

Profiling the CDK atom typer

I was doing some profiling (YourKit and Eclipse3.4) of the CDK atom typer, and it turns out that most time is spend on the perception of nitrogen atom types, which seems to be caused by the loadClassInternal() method of the JVM (java-1.5.0-sun- on Ubuntu Hardy):


  1. Might it be the case that there are more N atom types than other atom types?

  2. Yeah, should have copied those numbers in the screenshot too... Sorry about that.

    174 nitrogens (~77% of the time), 528 carbons (~13% of the time), etc...

    It's really out of proportion. The screenshot does show that ring detection is the problem, which is a required step, but it really worries me that the loadClassInternal() explains so much of this time...

    I've spoken with the people from Classpath, and they recommended longer profiling runs; this was just 15 seconds (1.5 seconds of not profiling); and it might simply be JVM start up issues (though the RingSet loading is long after all other IChemObject classes are loaded). I'll rerun the profiling against the 1000 ZINC structures asap.

  3. Sounds familiar and I am not surprised, JOELib2 uses a lot of reflection patterns for flexibility, but the time requirements are not optimal since this is done even on atom and bond level.

    I guess on the long term the only flexible and speedup workaround would be a compiler-compiler solution instead of reflection. Especially with the goal to create as much caching as possible, minimize expensive object or reflection object creations by pushing such things to a higher level or changing data structures.