Wednesday, December 28, 2005

The good, the bad and the ugly molecules

Derek Lowe is the author of the blog In the Pipeline which is really fun to read. Derek works in pharmaceutical industry and gives a great insight in how things work in that field of molecular sciences. Yesterday he blogged about What Makes an Ugly Molecule?, and touches the Rule-of-Five, the hydrochloric acid bath (aka stomach), and other reasons that make molecules ugly.

But there are many other interesting posts, and, something that my blog still lacks, comments by many users, discussing the ideas he posts, making his blog even nicer.

Tuesday, December 27, 2005

Knoppix saves the day...

After the three obligatory days of christmas holidays (fun, especially with two children, but very exhausting), it is time to get back to business again. I'm still at my father-in-laws place with only XP installed, so booted the Knoppix 4.0.2 DVD I burned last friday. Eclipse is not working, but being able to use Kmail to read my email again is just what you need as in internet-junkie. A computer is just not complete without a nice KDE session hanging around.

Anyway, booted eclipse on my computer at work, and tunneled the window over SSH. Not overly fast, but it seems to run fine. (If only I knew how to setup NX on that Kubuntu breezy system!) Let's see if I can get the CDK bug count somewhat lower.

Friday, December 23, 2005

Subset selection: mind the complexity

In a recent JCIM article, Schuffenhauer compares a few subset selection methods, and notes that some of them reduce the average complexity of the molecules. They put this in relation to other research that states that lead compounds with high complexity have higher activities. Recommended reading material for the holidays.

Sunday, December 18, 2005

StatCVS on CDK

One of the Classpath developers pointed me to their CVS statistics when I asked them how actively their project is currently developed, i.e. the number of active developers.

The pages are generated with StatCVS, so I ran it one the CDK too:

I knew I did a lot of work on the CDK, but never realized that 62.7% of the commits were mine! Keep in mind, though, that a lot of these commits are for code maintainance! Next in line are steinbeck and rajarshi. In total 28 people commited patches to CVS, though other people contributed patches too, which were commited by a developer with write access. There is jump in the commit messages somewhere this summer, which I think is the move of the data directory from cdk/data to cdk/src/data.

The full analysis results can be found here. It was generated with the StatCVS version in sid, and will rerun it soon with a more recent StatCVS version.

Friday, December 16, 2005

CDK Debug classes and fixing the ModelBuilder3D bug

For some weeks now I have been thinking about bug 1309731 : "ModelBuilder3D overwrites Atom IDs". The ModelBuilder3D is a complex piece of source code, reusing many other parts of the CDK, including atom type perception.

Somewhere in October, however, I found that Taverna could not create 3D models and convert these into reasonable CML because the Atom ID's were messed up. So the question is, where did the ModelBuilder3D do this? Did it do this itself, or is it done by one of the other pieces of CDK that it uses? But due to the complex nature of this algorithm, it quickly became clear that looking at the code was not going to solve it; there was too much code to look at.

The solution was clear to me: use the new data interfaces. To identify where the IDs where messed up, I only needed to write a DebugAtom class with a method that looked like:

public void setID(String identifier) {
logger.debug("Setting ID: ", identifier);

And I would immediately at what stage the ID was overwritten.

So I started this week to implement the DebugAtom and related classes. By extending Atom, I could just add debugging stuff and reuse the code in that class. However, the DebugAtom can not extend DebugAtomType too then. And this is a pity, because all methods inherited by the Atom interface from AtomType, Isotope, Element and ChemObject interfaces could not be inherited from the DebugAtomType class. Instead, they now have to duplicate those bits of code.

This is not a clean solution, as duplicate code is a known cause of bugs. So, the next step was to write JUnit tests for the new debug classes. And for this I wanted to reuse, i.e. extend, the tests for the default data classes. This required, however, changes to those test classes.

The first thing that needed to be changed was that instantiation of data classes in the tests would now have to depend on the data classes being tested. A simple

Atom atom = new Atom("C");

only makes sense when a specific Atom class was important. Fortunately, the new interfaces provide a solution for this: the ChemObjectBuilder implementations. These allow to use the following syntax to replace the hard coded instantiation:

Atom atom = builder.newAtom("C");

Therefore, I added a protected field to the AtomTest, which was instantiated in the setUp():

protected ChemObjectBuilder builder;
public void setUp() {
builder = DefaultChemObjectBuilder.getInstance();

and use this builder to instantiate all test objects, as shows for the atom above.

And then I can simply reuse this JUnit test by defining the DebugAtomTest like:

public class DebugAtomTest extends AtomTest {
public DebugAtomTest(String name) {

public void setUp() {
super.builder = DebugChemObjectBuilder.getInstance();

public static Test suite() {
return new TestSuite(DebugAtomTest.class);

The sources for these debug data classes tests are found in the new cdk.test.debug package.

The number of JUnit tests for the CDK jumped from around 1250 to over 1500 tests right now. And if you think these new tests only test old code, because of all the super.bla() calls in the debug classes, you're way off. I found bugs in the new debug classes, but also many class cast bugs and several other problems in the real data classes!

Anyway. Does this help fix the ModelBuilder3D bug? Yes, it does:

$ grep "Setting ID" reports/result.modeling.builder3d.ModelBuilder3dTest.txt
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: carbon1
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: oxygen1
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: C
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HC
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HC
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HC
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: O
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HO

This shows me where the Atom ID is overwritten to be something other than "carbon1"! I can now look at the rest of the result.modeling.builder3d.ModelBuilder3dTest.txt file to see what the ModelBuilder3D was doing at the time, and which CDK class made the setID() call.

I only needed to change this line in the JUnit test for the bug to generate the above debug lines:

Molecule methanol = new Molecule();


Molecule methanol = new DebugMolecule();

Tuesday, December 13, 2005

Math libraries for Java?

I drop in on the #classpath channel of IRC network, where the #cdk channel runs too. The #classpath channel is for the Classpath project which is developing the free Java libraries used by most open source virtual machines.

A item was mentioned "Java Is So 90s". It lead to a funny discussion about what that would make C/C++ and Fortran. A more serious question was brought up: where are the efficient and super fast Java linear algebra and complex number libraries?

There is Weka but it is more aimed at data analysis. I believe it has support principle component analysis, so it must have singular value decomposition. There is a book called Java Number Cruncher: The Java Programmer's Guide to Numerical Computing by Ronald Mak, 2003, Prentice Hall.

After some further asking about it on the channel, they mentioned the Apache commons math project, which seems promising. The website mentions complex numbers, linear algebra, statistics and numerical analysis, but have not looked at the full API, so not sure how well populated these areas are.

Anyone, with experience in the area of numerical computing and Java?

Saturday, December 10, 2005

Jumbo 5.0 and the CDK

I reported earlier that the CDK has been updated in CVS to use CML from the new Jumbo 5.0. The transition actually involved a lot of changes in the CDK, some I would like to address in the following comments. One thing is that CML write support (not reading!) uses the new Jumbo library which requires Java 1.5. Thus, if Java 1.5 is not available, then CML writing should not be compiled. This is how this is done.

The JavaDoc

The CDK makes extensive use of JavaDoc taglets. CDK uses tags of type @cdk.SOMETAG. And an important tag in this case, is the @cdk.require tag, becuase it allows us to make the CDK build system aware that the class requires Java 5.0 to be compiled. Thus, we have for example this code in CVS, of which bits are:

* Serializes a SetOfMolecules or a Molecule object to CML 2 code.
* Chemical Markup Language is an XML based file format {@cdk.cite PMR99}.
* Output can be redirected to other Writer objects like StringWriter
* and FileWriter.
* @cdk.module libio-cml
* @cdk.builddepends xom-1.0.jar
* @cdk.depends jumbo50.jar
* @cdk.require java1.5
public class CMLWriter extends DefaultChemObjectWriter {

As probably is clear compiling this jars requires a two jars to be present, of which the jumbo50.jar itself is not required for compiling the class source code. It also shows the use of the @cdk.require tag.

The build.xml

Because the CDK still does not require Java 1.5, the CDK is supposed to be buildable with Java 1.4 (the oldest supported Java release). The Ant build.xml script is quite able to conditionally leave out compiling parts of the CDK, if configured correctly using proper JavaDoc tags, as explained earlier.

First, the build.xml checks what libraries are available for compiling certain parts of the CDK. For example, the build.xml code to check for Java 1.5 looks like:

<condition property="isJava15">
<contains string="${java.version}" substring="1.5"/>

Run ant info to see what is being checked for, or look at the build.xml source code for the check target.

All compiling is done by the compile-module target, and there it in- and excludes bits of the CDK depending on the checked conditions:

<javac srcdir="${build.src}" destdir="${build}" optimize="${optimization}"
debug="${debug}" deprecation="${deprecation}">

<excludesfile name="${src}/java1.4+.javafiles" if="isJava13"/>
<excludesfile name="${src}/java1.4.javafiles" unless="isJava14"/>
<excludesfile name="${src}/java1.5.javafiles" unless="isJava15"/>
<excludesfile name="${src}/ant1.6.javafiles" unless="hasAnt16"/>
<excludesfile name="${src}/r-project.javafiles" unless="rispresent"/>

<includesfile name="${src}/${module}.javafiles"/>

Keep in mind that the *.javafiles are created with JavaDoc based on the CDK JavaDoc tags mentioned earlier.

The build.xml 2

While the above mechanism has been present since for some time now, having jumbo50.jar in CVS made the situation a bit trickier: the jumbo50.jar uses the 49.0 class format used in Java 1.5, and cannot be processed by Java 1.4 systems. Since the classpath used when compiling CDK source code, is defined in configuration files for those modules in src/META-INF, the problem did not occur when compiling the modules. However, it did show an error in the reallyRunDoclet target today, when I was creating the *.javafiles with JavaDoc. The solution was trivial:

<target name="reallyRunDoclet" id="reallyRunDoclet"
depends="compileDoclet" unless="dotjavafiles.uptodate">
<javadoc private="true" maxmemory="128m">
<fileset dir="${lib}">
<include name="*.jar" />
<!-- some jars require some Java version -->
<exclude name="jumbo50.jar" unless="isJava15"/>
<fileset dir="${lib}/libio">
<include name="*.jar" />
<fileset dir="${devellib}">
<include name="*.jar" />

<doclet name=""

<packageset dir="${src}">
<include name="org/openscience/cdk/**"/>



There is another area of interest: the FileConvertor, which is, sort of, CDK's OpenBabel's babel variant. The FileConvertor must be compiled in all cases, so we need to conditionally instantiate the CMLWriter, which is not really a problem. However, compiling the source code is more troublesome: the CMLWriter class must be loaded on runtime, and not occur hardcoded in the source code.

In the past I have solved this by using .getInstance() constructs, but the ChemObjectWriter interface does not define this functionality, so I decided to use the java.lang.reflect mechanism:

} else if (format.equalsIgnoreCase("CML")) {
Class cmlWriterClass = this.getClass().getClassLoader().
if (cmlWriterClass != null) {
writer = (ChemObjectWriter)cmlWriterClass.newInstance();
Constructor constructor = writer.getClass().getConstructor(new Class[]{Writer.class});
writer = (ChemObjectWriter)constructor.newInstance(new Object[]{fileWriter});
} else {

Now, this has been, by far, the longest blog item I have written so far. I hope it gave you good insight in some techniques CDK uses to deal with situations where functionality might, or might not, be present at build and at run time.

Thursday, December 08, 2005

Jumbo 5.0 and CML support in CDK

Tobias commited Jumbo 5.0 to CDK CVS, so that the CDK is now again up to date with the latest CML library. Note that Jumbo 5.0 requires Java 5.0.

At first all JUnit tests seems to work, but apparently the CML2Writer tests were skipped because they were only run when Java 1.4 was found. I updated the test for the a appropriate Java version, and then it turned out that most tests fail. So those running CDK from CVS and depent on CML writing: hang on, it will be fixed very soon.

Tuesday, December 06, 2005

UML diagram of CDK module dependencies

The code clean up after CDK's interfaces transition is in progress, and two CDK modules are now independent of the data module. After doing the core module, the standard was next, and I finished this yesterday. The dependencies in CVS now look like (click it to get a larger view):

This UML diagram was made with , and the source is in XMI in CVS.

I cannot stress enough the advantages of these changes:

  1. the code is cleaner
  2. module dependencies are cleaner
  3. impossible to use methods outside the interface
  4. the algorithms are independent of the data classes

The last advantage is really important: it allows alternative implementations of the data classes. For example, we could make debug data classes, which, unlike the normal classes, do all sorts of checks when using methods of these classes. For example, they can explicitely check that parameters are not null, of the right class, and generally make sense. This makes them, possibly, slower, but also more type save, and as such great for debugging and development sessions.

Another important application of making the CDK library independent of the data classes (and only depending on the interfaces), is that we can have data classes shared with other Java libraries, such as JOElib, Octet, CML (Jumbo 5.0 is out!), and even proprietary libraries. This approach is already used in the CDK-Taverna library, and I anticipate much wider use with the arrival of Bioclipse.

Sunday, December 04, 2005

Planet Blue Obelisk website updates

After requests I added yesterday more visible the RSS and Atom feeds for the Planet Blue Obelisk. They are linked in the menu on the right, and as alternative links to the document. These should show up in most recent webbrowsers as feed icon in the lower right corner of the browser window. It is often an orange icon. I also added a 'Leave a comment' link to encourage people to leave comments on items. Please do!

Saturday, December 03, 2005

About JChemPaint's future and todays 2.1.5 release

Stefan has done an excellent debugging week on JChemPaint, while I have been late with a 2.1 release. Anyway, I've just uploaded a Java 1.4 compiled JChemPaint 2.1 series release. I was told the (reported) bug count is down to one, so I expect to see the next stable branch to be released soon (2.2 series).

But what after JChemPaint 2.2 gets released? Will a 2.3 developers branch be opened? Or will the JChemPaint application, as we know it, cease to exist, and make place for the Bioclipse JChemPaint plugin, that is being worked on?

It is worth mentioning the pros and cons of JChemPaint. One big pro is the applet version of JChemPaint, though free but closed source alternatives are available (e.g. MarvinSketch). Another advantage is the great semantics of the chemistry being drawn. For example, when drawing reactions, reactants are really marked as reactants, and are not just molecules left of an arrow. Moreover, JChemPaint is a great platform in which ideas can be tested! One of the key virtues of opensourceness. Cons include the limited amount of templates, print quality graphics, and others. (Comments on JChemPaint most welcomed.)

So what about this Bioclipse then? It is inheritently SWT based, but currently the SWT_AWT bridge is used to embed to current JChemPaint and underlying CDK code as is. Unfortunately, this bridge is using proprietary code from Sun (sun.awt classes), which makes it impossible to use with free virtual machines.

But there is also the option of using the SWT drawing classes. This has the advantage that it can be run with free virtual machines, and that it can even be compiled to native code. It requires serious rewriting of code in the JChemPaint and CDK code base. But, CDK's Renderer2D needs a rewrite anyway: it does not even use Swing's Java2D efficiently (try to figure out how it transforms atomic 2D coordinates into screen coordinates!). Some efforts have been ongoing, but a rewrite from scratch, with a better, more modular, design cannot hurd at all.