Tuesday, March 30, 2010

Cleaner CDK Code #4: inheriting JavaDoc from super classes and interfaces

When you write a class implementing an interface or extending a super class, it is often the case that the API is identical. It would be nice to inherit the JavaDoc documentation, which is possible.

Inheriting JavaDoc
Java method that overwrite a superclass method or implement an interface method, can inherit JavaDoc by including this JavaDoc for that method:
/** {@inheritDoc} */
public String getMIMEType() {
  return null;
The JavaDoc documentation notes that missing @param and @return values are inherited implicitly since JavaDoc 1.3, but I have never noticed this in Java6. The above explicit markup is confirmed to work.

Monday, March 29, 2010

ACS Boston: RDF Symposium update

The deadline for abstract submission for the ACS symposium on Semantic Chemistry with the Resource Description Framework has passed. Eleven independent abstracts made the deadline. Here's what you can expect in Boston:

If you are new to Resource Description Framework and related technologies, I can recommend the book Programming the Semantic Web, which I read with pleasure.

CDK-Taverna paper published

It took a while, but the CDK-Taverna paper (doi:10.1186/1471-2105-11-159) which has been in preparation for a while in the CDK subversion repository, is now published. Christoph already wrote up a brief explanation in his blog:
    The workflow paradigm allows scientists to flexibly create generic workflows using different kinds of data sources, filters and algorithms, which can later be adapted to changing needs. In order to achieve this, library methods are encapsulated in Lego(TM)-like building blocks which can be manipulated with a mouse or any pointing device in a graphical environment, relieving the scientist from the need to learn a programming language. Building blocks, so-called workers, are connected by data pipelines to enable data flow between them, which is why pipelining is often used interchangeably for workflow.
Taverna is not the only open source workflow environment, but certainly gotten a lot of visibility in the eScience communities in at least The Netherlands and the UK. There exists other workflow environments too with CDK nodes, including KNIME which is since version 2.1.0 licensed GPL3.

Thomas uploaded some 17 example workflows to, to give you further idea what the system can do. Development has gone down considerably, since Thomas finished his thesis, and if you like to work on the CDK-Taverna project, and be the next Dr Who, please contact me, Achim or Christoph. I started experimenting with CDK nodes for Taverna in 2005 (see CDK-Taverna fully recognized), and would love to see it live on. Andreas and I made an attempt last December to port things to Taverna 2.1, and the code we worked on can be found in this GitHub repository.

Saturday, March 27, 2010

Updated my blog template

I noted today that, the blog service provider I am using, had new templates. I was getting tired of the old one anyway, so tried the simple template using the usual orange: quite satisfactory! At least, beats buying a book like this Blogger: Beyond the Basics. Don't have time for that.

I tweaked the template a bit. For example, the default labels widget does not allow me to limit the shown labels to those with at least X uses. So, I hacked the HTML of the widget and added an extra if statement:
<b:loop values='data:labels' var='label'>
  <b:if cond='data:label.count > 6'>
    <span expr:class='"label-size label-size-" + data:label.cssSize'>
      <b:if cond='data:blog.url == data:label.url'>
        <span expr:dir='data:blog.languageDirection'>
        <a expr:dir='data:blog.languageDirection'
      <b:if cond='data:showFreqNumbers'>
        <span class='label-count' dir='ltr'>
</b:loop> Pages
I also move around some element, and also nice is the new Pages concept. I wish I could hide the side bar on these pages, but the currently fairly happy with the ability to embed my homepage:

Internet Explorer 6 EOL
The new template does not work with Internet Explorer 6.0. Honestly, I see no reason why you would like to run that browser anyway, but now you no longer can use it to read my blog. Just upgrade, and complain with your IT department if you cannot do it yourself. There are not so many of you, though. Only 18.95% uses Internet Explorer, of which about 20% still uses 6.0:

Actually, of the 73 visits with IE6 in the past 30 days, only 12 were hits of regular visitors. So, could I please ask this one visitor to email me offline if upgrading is not an option?

Friday, March 26, 2010

Cleaner CDK Code #3: run the PMD tests

PMD is a tool to run some tests against your source code. The check for code style, common problems, and places where code could be improved. The CDK has been using it for years now, such as here for CDK 1.3.x.

Running the PMD tests from the command line
When you are writing patches for the CDK, you can run the PMD tests via an Ant file, for example via the command line:
$ ant -f pmd.xml
However, when working on a single file, you will likely appreciated running the tests against a single module. This can be done with (for the data module):
$ ant -f pmd.xml -Dpmd.test=custom -Dmodule=data test-module
The custom.xml defines the tests we normally run.

The pmd.xml does not create HTML pages, like Nightly does. Instead, an XML file is currently created. The xpath utility can be used to filter out the information we are interested in. For example, if we want to reports just about DefaultChemObjectBuilder, we issue:
$ xpath -e "//violation[@class='DefaultChemObjectBuilder']" \\

Monday, March 22, 2010

Oxford, August 2010: eCheminfo Predictive ADME & Toxicology 2010 Workshop

The first week of August I will attend the eCheminfo Predictive ADME & Toxicology Workshop (LinkedIn Event) for which I received a Bursary Award. It will be my first time in Oxford, and I am very much looking forward to it!

The meeting is also bound to be fun. I have not done much in the area of toxicology other than the more general QSAR/QSPR model building with chemometrics. But I have been recently taking to Nina and other of the OpenTox community, and started to play a bit with the data and computation API they are developing.

I started a Bioclipse plugin recently (see screenshot), and placed the source code in this bioclipse-opentox Git repository on Gitorious (my GitHub account is already over the formal limit). The functionality is still quite limited, and the manager currently only provides methods to download data sets (myexperiment:1192):
// query a service using the OpenTox API 1.1
// See:

var service = "";

var datasets = opentox.listDataSets(service);
for (set=0; set<datasets.size(); set++) {
  var dataset = datasets.get(set);
  js.say("Downloading set: " + dataset);
       service, dataset, "/OpenTox/ambit" + dataset + ".sdf"
Behind this plugin is again the RDF plugin, as OpenTox uses RDF too, a few simple SPARQL queries was all that needed to be defined. And again, the Bioclipse pluigin code base is pretty small.

Wednesday, March 17, 2010

Cleaner CDK Code #2: String.contains() and logger messages

Second in the series (see #1), with two rather small tips.

Use String.contains() instead of String.indexOf("foo") != -1
Java 5 introduced the method public boolean contains(CharSequence s), which can replace the more cryptic use of indexOf() != -1.

Instead of:
System.getProperty("java.version").indexOf("1.6") != -1
you can write:
More efficient use of the LoggingTool
Quite a long time ago, Jmol developer Miguel introduced me to a nice performance hint with respect to using logging tools. Each debug(), info(), warn(), etc method should take more than one parameter, so that only when debugging (or the debug level) is turned on, the objects are concatenated. It indeed gave a considerable performance boost to things. The CDK supports this too, and you should not concatenate Strings and other objects, but let the LoggingTool do that.

Instead of:
  "\n" + paths.size() + " paths and " +
  ac.getAtomCount() + " atoms left."
you can write:
  "\n", paths.size(), " paths and ",
  ac.getAtomCount(), " atoms left."

Monday, March 15, 2010

RDF-powered QSAR wizard: SPARQL end points providing wizard content

As you know from my blog, one of the things I am working on is to push RDF functionality in Bioclipse, as I believe it to be the missing link between molecular chemometrics and literature, databases, and other non-numerical information sources.

As part of the submission for the SWAT4LS special issue in the new Journal of Biomedical Semantics, Ola hacked up a cool wizard that sets up a new QSAR Project by downloading data directly from our RDF node for the chEMBL data using SPARQL. The paper is based on the SWAT4LS talk I gave, and the proceedings paper that recently appeared. But with more cool stuff, such as this cool RDF graph browser that allows you to open up molecules from the RDF graph in a JChemPaint editor.

Well, this really nice New QSAR Project wizard was cool enough to trigger a I-want-more reaction, so I just had to hack it up with some additional SPARQL functionality. So, the next version does not only use RDF and SPARQL to aggregate the QSAR data set, it also uses SPARQL to make the wizard interactive. While the user is typing a target ID, the wizard will check the SPARQL end point in the background and download the target's type, title and organism, as well as update the list of activities the user can select depending on what the chEMBL database has for that target:

The actual code base is pretty small, and that's what happens when you mash up the right technologies :)

Sunday, March 07, 2010

Cleaner CDK Code #1: List and the for-each loop

In a desperate attempt to force me to write on my CDK code snippet book, I'm going to write some code tips to create clear code. Hopefully, this is useful for people writing patches and reviewers alike, too.

Use List instead of the untyped List
Quite some time ago, the Java language introduced typed lists. These lists can contain only objects of a particular type, which is a very common use case. Indeed, the CDK has quite a few lists that are strongly typed. Typing the list prevents you from accidentally adding something of the wrong type, but also reduced the amount of casting, so that your code becomes cleaner.

Instead of:
List atoms = atomContainer.atoms();
for (int i=0; i<atoms.size(); i++) {
  IAtom atom = (IAtom)atoms.get(i);
  // .. do something
you can do:
List<IAtom> atoms = atomContainer.atoms();
for (int i=0; i<atoms.size(); i++) {
  IAtom atom = atoms.get(i);
  // .. do something
If you do not need the index, use a for-each loop
When iterating over atoms in a list, you sometimes need to know the index, for example, to compare the IAtom with that at the same position in another list. However, when this is not needed, you can use the Java for-each loop instead. This will further simplify the above code to:
for (IAtom atom : atomContainer.atoms()) {
  // .. do something

Saturday, March 06, 2010

OOChemistry 0.1 released: call for participation

Konstantin released OOChemistry 0.1 and sent this email to the cdk-jchempaint mailing list (I added a few links and an extra newline):
    Hello, colleagues!

    I'd like to announce first alpha version of OOChemistry. It is an extension for which provides cross-platform OLE-like integration of OOo with JChemPaint chemical diagram editor. With OOChemistry you can draw structure, embed into document (text or presentation) and than double click and edit whenever you want on any platform having and Java Runtime (Windows, Linux, Mac OS X, other Unix flavours). Remember that it's only alpha and is not recommended for production use (e.g., compatibility with futher versions is not guaranteed).

    Known bugs:, (please, submit if you found other issues!)
    More about project, contact info, how to contribute:
    Direct link to download:

    OOChemistry needs your help! Experience in Java, in development of projects dealing with JChemPaint/CDK, or in development of extensions will be highly appreciated. Of course, you can help not only in coding, but also in translation of interface and writing docs.


Your feedback as well as coding contributions are very much appreciated! I am excited about seeing chemical editing facilities in, and while the integration is not as good as Chem4Word, it is something I can run on my Linux system.

Friday, March 05, 2010

CDK 1.3.3: the changes

The CDK 1.3.3 release does not contain overly many patches, but contains a few interesting ones:
  • Updated JavaDoc to explicitly state that g2 must be a substructure of g1 2660aca
  • More unit tests for the MCSS problem in bug report 2944080. 3d18a73
  • Simplified the code using the new 'T read(T)' API used in MDLV2000Reader as defined by the ISimpleChemObjectReader af12e8a
  • Updated for the new generics 'T read(T)' API in ISimpleChemObjectReader. d3f2f19
  • Introduced generics allowing the return type to be identical to the passed argument. It does require implementing classes to be updated with the new API too. 9775992
  • Added missing dependency, fixing the unit test reading a file from data/ b4a6dfa
  • added working implementations for PartialFilledStructureMerger and CrossoverMachine c086a97
  • added working implementations for PartialFilledStructureMerger and CrossoverMachine 089103e
  • tests for crossover machine and PartialFilledStructureMerger 6441b75
  • tests for crossover machine and PartialFilledStructureMerger 5481be7
  • added dependency 3b0b56d
  • Fixed use of global isRef variable, to make it threading-safe e576e0b
  • Added control 'isref' creating a CML with reaction and listmolecules d9d20c2
  • Removed unused import e1c03fb
  • Removed last bits of implementation details from the API: now uses List<> instead of ArrayList<> 7727b72
  • Removed output to STDOUT 14e1d12
  • Fixed some spelling errors and added JavaDoc links 677b3f6
  • Synchronized behavior with the MDLV2000Reader (addressing bug #2942196) 2ceef95
  • Added missing @cdk.bug tag and used interfaces where possible 04001b7
  • Added a test case for GeometryTools.has2DCoordinatesNew where a mol file has a single atom with 0,0,0 as coordinate. This is not considered a 2d coordinate right now, but in a way it is one. f5e123e
  • added a method to make cyclopentane to MoleculeFactory 0064458
  • Added missing unit test for getClosestAtom(double, double, IAtomContainer, IAtom) 95f811a
  • Improve performance: to find the closest atom, we do can simply use the squared distances. The smaller than relation is equivalent in normal and squared distance space. 46b5f83
  • Added a unit test to see of the calculated bond length average includes bonds in all IAtomContainer's 563fe28
  • Added second test for getClosestAtom(), now with more than two atoms 5724205
  • Added E and Z as allowed configurations 7a1b919
  • Added UP_OR_DOWN_INVERTED, which is the equivalent of UP_OR_DOWN but with a different stereocenter f175650
  • Extended JavaDoc, explaining how these IBond.Stereo types define the stereocenter, and indicating for each type explicitly which atom is the stereocenter 35af889
  • Added convenience method to find the closest atom to a given point. 8d5ee5c
  • unified the layout at cleanup and loading of molecules f071090
  • Reimplemented shiftReactionVertical(IReaction, Rectangle2D, Rectange2D, double) originally implemented as jchempaint-primary patch 1cab8c3c9350ada9b1d054712189720c865e502a by Stefan Kuhn: now reuses other methods (fixing the movement of the reaction agents), and added missing unit tests 230b7e1
  • Added a getBondLengthAverage(IReaction reaction) method, a rewritten version of 1cab8c3c9350ada9b1d054712189720c865e502a by Stefan Kuhn, and the matching unit test d37498a
  • Add a GeometryTools method to get atoms near another atom (Gilleain). Added unit test for the new method (Egon). 420ab11
  • Moved IAtomColorer and ICDKChangeListener from the standard module to the interfaces module fdbadae
  • Updated to include the float and binary information found in PubChem 8eacbfc
  • Ant has a release 1.8 that should be accepted in build.xml 4398cc4
In a brief summary, this release mostly focuses on applying a number of small bug fixes and patches. But there are some things of interest: Stefan is working on structure generation and rewrote PartialFilledStructureMerger and CrossoverMachine. I introduced some generics magic in the reader API which I learned from Arvid in the CDK-JChemPaint patch. This patch removes the need to cast when reading an IChemObject from a file in the readers which have been updated (MDLV2000Reader only at this time). Instead, you can now just type:
IMolecule mol = Molecule());
The list of patches furthermore contains an update of the PubChem reader to support reading of additional fields, and the support of the CML @ref attribute in CMLReact (doi:10.1021/ci0502698).

But the most interesting bit of this release is to me, that the last few patches are now reviewed and applied to make CDK-JChemPaint compile against a off-the-shelf CDK release (1.3.3 or higher :).

26  Egon Willighagen
 9  Stefan Kuhn
 1  Brian Gilman
 1  Mark Rynbeek
 1  Miguel Rojas Cherto
 1  Gilleain Torrance
This is a new category too, and created using the command git log cdk-1.3.2.. | grep Signed-off | cut -d':' -f2 | cut -d'<' -f1 | sort | uniq -c. Not every reviewer signs off commits, and no one other than the current commit right owners actually do this. Everyone is more than invited to check the patch tracker, and review patches give comments if you feel the patch can be improved, or sign it off otherwise (git commit --amend --signoff), which gives the other reviewers some idea of the state of the patch. Rajarshi did most of the reviewing work of this release; his contributions are very much appreciated.
23  Rajarshi Guha
 4  Egon Willighagen

Thursday, March 04, 2010

RDF, Jena, Bioclipse, Eclipse, Zest #2: icons and an extension point

Jonathan worked this week on new features for the Bioclipse RDF editor (see these two earlier items). This version still does not edit, but only display using Zest. Jonathan created for me an extension point so that anyone can make the editor aware of domain objects, by simply registering the extension implementation along with the rdf:Class URI of the rdf:type of an object. This fixes the problem of having to hardcode dependencies of the RDF editor on all the domain code, as was the case earlier.

For example, the cheminformatics IMolecule object is now linked to the rdf:type <>:
<extension point="net.bioclipse.rdf.rdf2bioobjectfactory">
    uri="" >
The API for this factory looks like:
public IBioObject rdfToBioObject( Model model, Resource res );
public ImageDescriptor getImageDescriptor();
This is very much tied into the Jena data model, so not entirely clean, but has to do for now. The first method converts RDF content into a Bioclipse IBioObject, such as an IMolecule (see this list of currently supported objects). The second method returns an icon, which makes the editor more visually pleasing, and provides a nice way to see when you can double click the RDF node to have it open in an domain specific editor:
For example, double clicking the ron:mol2 node, would open up a JChemPaint editor.