Tuesday, December 30, 2008

Editing and Validation of CML documents in Bioclipse

One advantage of using XML is that one can rely on good support in libraries for functionality. When parsing XML, one does not have to take care of the syntax, and focus on the data and its semantics. This comes at the expense of verbosity, though, but having the ability to express semantics explicitly is a huge benefit for flexibility.

So, when Peter and Henry put their first documents online about the Chemical Markup Language (CML), I was thrilled, even though is actually was still SGML when I encountered it. The work predates the XML recommendation. As I recently blogged, in '99 I wrote patches for Jmol and JChemPaint to support CML, which were published as preprint in the Chemical Preprint Server in a paper in 2000 in the Internet Journal of Chemistry. Neither of the two has survived.

Anyway, the Chemistry Development Kit makes heavy use of CML, and Bioclipse supports it too. Now, Bioclipse is based on the Eclipse Rich Client Platform architecture, for which there exist quite a few XML tools in the Web Tools Platform (WTP). Among these, a validation, content assisting XML editor. This means, I get red markings when I make my XML document not-well-formed or invalid. Just a quick recap: well-formedness means that the XML document has a proper syntax: one root node, properly closed tags, quotes around attribute values, etc. Validness, however, means that the document is well-formed, but also hierarchically organized according to some specification.

Enter CML. CML is such a specification, first with DTDs, but after the introduction of XML Namespaces with XML Schema (see There can be only one (namespace)). The WTP can use this XML Schema for validation, and this is of great help learning the CML language. Pressing Ctrl-space in Bioclipse will now show you what allowed content can be added at the current character position.

Yes, Bioclipse can do this now (in SVN, at least). This has been on my wishlist for at least two years now, but never really found the right information. Now, three days ago David wrote about End of Year Cramps in which he describes some of his work on the WTP for autocomplete for XPath queries. He see[s] a brighter future for XML at eclipse over the next year. I hope that those in the eclipse and XML community will help to continue to improve the basic support, so that first class commercial quality applications that leverage this support can continue to be built.

That was enough statement for me to ask in the comments on how to make the WTP XML editor aware of the CML XML Schema. It already picked up XML Schema's with xsi:schemaLocation, but I needed something to worked without such statements in the XML document itself. David explained that me that I could use the org.eclipse.wst.xml.catalog extension. This was really easy, and commited to Bioclipse SVN as:
<uri name=""
However, that does not make the WTP XML editor available in the Bioclipse application yet. Not ever in the "Open With"... So, I set up a CML Feature. After a follow up question, it turned out that the CML content type of Bioclipse was already a sub type of the XML type (see ):
name="Chemical Markup Language (CML)"
So, the only remaining problem was to actually get the WTP XML editor as part of the Bioclipse application. The new CML Feature takes care of that (I hope the export and building the update site work too, but that's yet untested), by important the relevant plugins and features. Last night, however, I ended up with one stacktrace which gave me little clue on which plugin I was still missing.

Therefore, I headed to #eclipse and actually met David of the blog that started this again. He asked nitind to think about it too, and they helped me pin down the issue. This relevant bit of the stacktrace turned out to be:
Caused by: java.lang.IllegalStateException
at org.eclipse.core.runtime.Platform.getPluginRegistry(
at org.eclipse.wst.common.componentcore.internal.impl.WTPResourceFactoryRegistry$ResourceFactoryRegistryReader.(
at org.eclipse.wst.common.componentcore.internal.impl.WTPResourceFactoryRegistry.(
at org.eclipse.wst.common.componentcore.internal.impl.WTPResourceFactoryRegistry.(
... 37 more
This refered to this bit of code of Eclipse'
Bundle compatibility = InternalPlatform.getDefault()
if (compatibility == null)
throw new IllegalStateException();

So, the plugin I turned to to have missing was org.eclipse.core.runtime.compatibility. Apparently, some parts of the WTP that the XMLEditor is using, still uses Eclipse2.x technology.

This screenshot shows the WTP XMLEditor in action in Bioclipse on a CML file. It shows the document contents with the 'Design' tab, which also shows allowed content, as derived from the XML Schema for CML. Also, note that the Outline and Properties view automatically come for free, which allows more detail and navigation of the content.

This screenshot shows the 'Source' tab for the same file, where I deliberately changed the value of the @id attribute of the first atom. The value does not validate against the regular expression defined in the CML schema for @id attribute values. It also shows the content assisting in action. At any location in the CML file, I can hit Ctrl-Space, and the editor will show me which content I can add at that location.

This makes Bioclipse a perfect tool to craft CML documents and learn the language.

11 Years of Debian

11 years ago, a day more or less, I bought an the special issue of CHIP which shipped Debian 1.3.1. I think I've tried SuSe and RedHat earlier that year, but this Debian release made me switch away from proprietary products 98% (taxes I still had to do with Windows98). Right now, I am mostly running Ubuntu, which leans heavily on the work of the Debian project.

I celebrated by installing a prerelease of Lenny, Debian's next stable release, but still testing now, in a virtual box with VirtualBox. Works like a charm, and will allow me in 2009 to finally pick up some packaging work for Debian, and maybe, finally, get Jmol available in Debian main.

Friday, December 26, 2008

State of CDK 1.2.0...

The reason why I have not blogged in more than two weeks, was that I was hoping to blog about the CDK 1.2.0 release. This was originally aimed at September, slipped into October, November and then December. There were only three show stoppers (see this wiki page), one of which the IChemObject interfaces were not properly tested.

The problem was that the unit tests for the methods in superinterfaces were not applied to implementations of subinterfaces. For example, the unit test for IElement.getSymbol() was not applied to the class Atom, which implements IAtom which is a subinterfaces of IElement.

In fixing this, I had to take some hurdles. For example: the unit test classes used a set up following the implementations; CDK 1.2.x has three implementations of the interfaces: data, datadebug and nonotify. The last does not send around update notifications, and rough tests indicate it is about 10% faster. The second implementation sends messages to the debugger for every modification of the data classes, which is, clearly, useful for debugging purposes.

However, the JUnit4 test classes were basically doing the same. The unit test DebugAtomTest inherited form AtomTest, and only overwrote customizations. AtomTest, itself, inherited from ElementTest. That's where things got broken. In the single implementation set up, this would have been fine, but to allow testing of all three implementations, getBuilder() had to be used.

And when I implemented that, I did not realize that ElementTest would do a test like:

IElement element = builder.newElement();
// test IElement functionality
However, while the use of builder ensure testing of all three implementations, it does not run these tests on IAtom implementations.

The followed a long series of patches to get this fixed. One major first patch, was to define unit test frameworks like AbstractElementTest which formalized running unit tests on any implementation, as I noticed that quite a few tests were still testing one particular implementation. This allowed DebugElementTest to extend AbstractElementTest, instead of ElementTest, which would now extend AbstractElementTest too.

OK, with that out of the way, it was time to fix running the unit test for IElement.getSymbol() on IAtom.getSymbol(), which required the removal of the use of IChemObjectBuilder implementations. So, I introduced newChemObject() which would return a fresh instance of the actually tested implementation. That is, DebugAtomTest would return a new DebugAtom, and the getSymbol() test would now run on DebugAtom and not DebugElement. Good.

No, not good. The actual implementation I was using, looks like:

public class DebugElementTest extend AbstractElementTest {
@BeforeClass public static void setup() {
setChemObject(new DebugElement());

public abstract class AbstractElementTest extend AbstractChemObjectTest {
@Test public void testGetSymbol() {
IElement element = (IElement)newChemObject();
// do testing

public abstract class AbstractChemObjectTest {
private IChemObject testedObject;
public static setChemObject(IChemObject object) {
this.testedObject = object;
public IChemObject setChemObject(IChemObject object) {
return (IChemObject)testedObject.clone();
} // just imagine it has try/catch here too

// and here the tests for the IChemObject API
@Test public void testGetProperties() {
IChemObject element = (IChemObject)newChemObject();
// do testing
Excellent! No.

Well, yes. The above system works, but made many unit tests fail, because of bugs in clone() methods. The full scope has to be explored, but at least IPolymer.clone() is not doing what I would expect it to do. Either I am wrong, and need to overwrite the clone unit tests of superinterfaces in AbstractPolymerTest, or the implementations needs fixing. I emailed the cdk-devel mailing list and filed a bug report. But having about 1000 unit tests fail, because of clone broken, is something I did not like. For example, as it makes bug fixing more difficult.

So, next step was to find an approach that did not require clone, but give some interesting insights in the Java language. JUnit4 requires the @BeforeClass method to be static. This means I cannot have a non-static DebugElementTest method return an instance. And, you cannot overwrite a static method! That had never occured to me in the past. DebugElementTest.newChemObject() does not overwrite AbstractChemObjectTest.newChemObject which is somewhere upstream.

But, after discussing matters with Carl, I ended up with this approach:

public abstract class AbstractChemObjectTest extends CDKTestCase {
private static ITestObjectBuilder builder;
public static void setTestObjectBuilder(ITestObjectBuilder builder) {
AbstractChemObjectTest.builder = builder;
public static IChemObject newChemObject() {
return AbstractChemObjectTest.builder.newTestObject();

public interface ITestObjectBuilder {
public IChemObject newTestObject();

public class DebugAtomTest extends AbstractAtomTest {
@BeforeClass public static void setUp() {
setTestObjectBuilder(new ITestObjectBuilder() {
public IChemObject newTestObject() {
return new DebugAtom();

Monday, December 08, 2008

Peer reviewed Cheminformatics #2: Code review for the Chemistry Development Kit

Peer review is an important component of open source development, and recently there was the discussion the other way around, if open source is required for peer review. Depends on your definition of peer review: No, if you restrict peer review to what it is in publishing (see Re: Open Source != peer review); Yes, if we really want to speed up cheminformatics evolution and assume unrestricted, open peer review where reviewers can openly publish there review report with all the greasy details (see Peer reviewed Chemoinformatics: Why OpenSource Chemoinformatics should be the default).

The CDK has a strong history of peer review. Patches have been available from SVN from the start, and later we instantiated a mailing list so that people could easily monitor code changes, and I have actually being doing this since the start, scanning the code patches, knowing that a lot of code is backed up by unit tests to detect regressions. Anyone can review CDK code in this manner, just by subscribing to the cdk-commits mailing list. If one has questions or comments on a patch, a reply to cdk-devel is all that is needed to get things going.

About a year ago, CDK development had become so extensive that code review in this manner was no longer the way forward (though still possible, and still used). However, it turned out that it was all too easy to overlook a patch or just click it away in busy times. This was experienced by some developers who previously monitored the cdk-commit messages sketched above. So, we moved to a more formal patching system where any non-trivial patching is done in a SVN branch. Once the primary developer is happy about the branch, (s)he requests a review by other developers. These can leave comments in the source code, reply to the mailing list, or leave comments in the CDK patch tracker. This more formal work habit got into action about half a year ago already.

A recent message from Stefan makes clear that this tracker has some room for improvements. For example, there is no automatic email to cdk-devel when a patch has not been tended to for a longer period of time. And, I do not see a simple way of doing this with the SourceForge bug track system.

But, what I can do, is define a number of groups to represent the state of the patch. So, I defined:
  • Needs Review: this patch has not been reviewed (sufficiently) yet
  • Accepted: but not yet applied to SVN yet. When applied, the patch report is simply closed
  • Needs Revision: the reviewers like to see changes made to the patch
Not perfect, but a step forward in tracking the state of patches.

Friday, December 05, 2008

Cheminformatics Benchmark Project #1

Yesterday's blog about Who says Java is not fast?!? caused quite some feedback (thanx to all commenters!) with several good points. Of course, a table like that in the cinfony paper (see also the comments in the blogs by Noel (the author) and Rich). Many things determine why the CDK might be fastest in that table for SDF iterating. Suggestions have been that OpenBabel and RDKit may be doing much more than simple reading; Java might actually take advantage of the second core for caching file content.

ZZ observed something I overlooked: calculating the molecular mass in CDK is by far slowest of all three toolkit, though people have suggestions on why that may is.

The correct way to compare toolkits, open source, proprietary, free, commercial, is to have a proper benchmark toolkit for cheminformatics. That's what I am suggesting here: a project to define simple and fair benchmarks. It's an open project, and anyone can contribute in order to keep tests balanced in impartial towards any tested toolkit.

Thursday, December 04, 2008

Who says Java is not fast?!?

While performance tests actually show that for even core numerical calculations Java is at par with C in terms of speeds, and sometimes even hits Fortran-like speeds, people keep think that Java is not fast. I only invite you to test that yourself.

Meanwhile, I would like to take the opportunity to advertise Noel's cinfony paper in CCJ (doi:10.1186/1752-153X-2-24) which features these speed measurements (from the paper, CC-BY license): I have to say that these numbers surprised me, as the CDK is hardly optimized for speed at al...

Short variables and lack of comments...

... a source code reviewer nightmare. The must-read ran a nice open letter to a Linux kernel developer. I'd like to cite this bit about code review (see also Re: Open Source != peer review):
    [Andrew Morton] had a number of concrete requests - such as documenting the user-space ABI and the network protocol - which have not been satisfied. He also asked for better code documentation in general:
      So please. Go through all the code and make it tell a story. Ask yourself "how would I explain all this to a kernel developer who is sitting next to me". It's important, and it's an important skill.
This is important indeed! This is also why CDK quality assurance tends to complain about short variables. While an for-next index i is clear enough, ac for an IAtomContainer is quite useless, as it does not explain what the purpose of the container is. BTW, a longer name like atomContainer does not really help here. Maybe I will wrote a unit test for that...