Saturday, September 22, 2012

OMG! An Open Molecule Generator!

Earlier this week an important cheminformatics paper appeared in the Journal of Cheminformatics. It is about the Open Molecule Generator (see below for the paper). This was one important piece of functionality still missing from Open Source cheminformatics. This works uses the Chemistry Development Kit, and was written by Julio Peironcely.

The Analytical Biosciences' group of Prof. Hankemeier (and many others, including also Theo Reijmers) and funded by the Netherlands Metabolomics Centre has been using the CDK for metabolomics for a while now, with Miguel Rojas-Chertó as other principle user (and of course CDK developer!). I congratulate them all with this piece of work, and particularly with their choice of license!

Julio (with the other authors) have picked up a difficult algorithm, based in mathematics, but not the straightforward graph theory either. Others have tried to implement structure generation in the CDK, and I looked into this too, when working in Christoph Steinbeck's group back in Cologne. What the OMG team has achieved is significant.

The paper compares their results with MolGen, resulting in results like those in this table (from the CC-SA-BY paper):

It shows that the results are identical, when you consider the atom types it uses. And the use the CDK atom type framework I initiated, which is way cool! Julio found the tables I constructed from earlier CDK code incomplete (as did others) and extended them, to match their needs.

One "problem" with their current code base is that it is quite slow compared to OMG. This is easily compensated by the added functionality of OMG, such as restricting the structure generation with multiple fragments. Now, the CDK data classes are know to be somewhat sluggish, as compared to competition, but the community is increasingly improving this.

But I also think that the OMG use of Naughty via JNI is not helping performance either, and I hope someone will soon jump in and convert that C code into Java code, which should speed up performance too. Another side to this is that removing the dependency on C code will also make it easier to integrate into other tools, like Bioclipse, Taverna, and KNIME.

ResearchBlogging.orgJulio E Peironcely, Miguel Rojas-Chertó, Davide Fichera, Theo Reijmers, Leon Coulier, Jean-Loup Faulon, & Thomas Hankemeier (2012). OMG: open molecule generator Journal of Cheminformatics, 4 DOI: 10.1186/1758-2946-4-21

1 comment:

  1. It's great to see the code for this at last (it looks a lot like the algorithm in Julio's poster).

    The paper seems to be missing algorithm 3, however, and could probably be a little clearer. The code could do with an automated way to build a jar - given a path to the CDK.

    As for a c-to-java port of nauty : heh, that would be nice, yes. Useful for InChI as well.