Wednesday, December 28, 2005

The good, the bad and the ugly molecules

Derek Lowe is the author of the blog In the Pipeline which is really fun to read. Derek works in pharmaceutical industry and gives a great insight in how things work in that field of molecular sciences. Yesterday he blogged about What Makes an Ugly Molecule?, and touches the Rule-of-Five, the hydrochloric acid bath (aka stomach), and other reasons that make molecules ugly.

But there are many other interesting posts, and, something that my blog still lacks, comments by many users, discussing the ideas he posts, making his blog even nicer.

Tuesday, December 27, 2005

Knoppix saves the day...

After the three obligatory days of christmas holidays (fun, especially with two children, but very exhausting), it is time to get back to business again. I'm still at my father-in-laws place with only XP installed, so booted the Knoppix 4.0.2 DVD I burned last friday. Eclipse is not working, but being able to use Kmail to read my email again is just what you need as in internet-junkie. A computer is just not complete without a nice KDE session hanging around.

Anyway, booted eclipse on my computer at work, and tunneled the window over SSH. Not overly fast, but it seems to run fine. (If only I knew how to setup NX on that Kubuntu breezy system!) Let's see if I can get the CDK bug count somewhat lower.

Friday, December 23, 2005

Subset selection: mind the complexity

In a recent JCIM article, Schuffenhauer compares a few subset selection methods, and notes that some of them reduce the average complexity of the molecules. They put this in relation to other research that states that lead compounds with high complexity have higher activities. Recommended reading material for the holidays.

Sunday, December 18, 2005

StatCVS on CDK

One of the Classpath developers pointed me to their CVS statistics when I asked them how actively their project is currently developed, i.e. the number of active developers.

The pages are generated with StatCVS, so I ran it one the CDK too:

I knew I did a lot of work on the CDK, but never realized that 62.7% of the commits were mine! Keep in mind, though, that a lot of these commits are for code maintainance! Next in line are steinbeck and rajarshi. In total 28 people commited patches to CVS, though other people contributed patches too, which were commited by a developer with write access. There is jump in the commit messages somewhere this summer, which I think is the move of the data directory from cdk/data to cdk/src/data.

The full analysis results can be found here. It was generated with the StatCVS version in sid, and will rerun it soon with a more recent StatCVS version.

Friday, December 16, 2005

CDK Debug classes and fixing the ModelBuilder3D bug

For some weeks now I have been thinking about bug 1309731 : "ModelBuilder3D overwrites Atom IDs". The ModelBuilder3D is a complex piece of source code, reusing many other parts of the CDK, including atom type perception.

Somewhere in October, however, I found that Taverna could not create 3D models and convert these into reasonable CML because the Atom ID's were messed up. So the question is, where did the ModelBuilder3D do this? Did it do this itself, or is it done by one of the other pieces of CDK that it uses? But due to the complex nature of this algorithm, it quickly became clear that looking at the code was not going to solve it; there was too much code to look at.

The solution was clear to me: use the new data interfaces. To identify where the IDs where messed up, I only needed to write a DebugAtom class with a method that looked like:

public void setID(String identifier) {
logger.debug("Setting ID: ", identifier);

And I would immediately at what stage the ID was overwritten.

So I started this week to implement the DebugAtom and related classes. By extending Atom, I could just add debugging stuff and reuse the code in that class. However, the DebugAtom can not extend DebugAtomType too then. And this is a pity, because all methods inherited by the Atom interface from AtomType, Isotope, Element and ChemObject interfaces could not be inherited from the DebugAtomType class. Instead, they now have to duplicate those bits of code.

This is not a clean solution, as duplicate code is a known cause of bugs. So, the next step was to write JUnit tests for the new debug classes. And for this I wanted to reuse, i.e. extend, the tests for the default data classes. This required, however, changes to those test classes.

The first thing that needed to be changed was that instantiation of data classes in the tests would now have to depend on the data classes being tested. A simple

Atom atom = new Atom("C");

only makes sense when a specific Atom class was important. Fortunately, the new interfaces provide a solution for this: the ChemObjectBuilder implementations. These allow to use the following syntax to replace the hard coded instantiation:

Atom atom = builder.newAtom("C");

Therefore, I added a protected field to the AtomTest, which was instantiated in the setUp():

protected ChemObjectBuilder builder;
public void setUp() {
builder = DefaultChemObjectBuilder.getInstance();

and use this builder to instantiate all test objects, as shows for the atom above.

And then I can simply reuse this JUnit test by defining the DebugAtomTest like:

public class DebugAtomTest extends AtomTest {
public DebugAtomTest(String name) {

public void setUp() {
super.builder = DebugChemObjectBuilder.getInstance();

public static Test suite() {
return new TestSuite(DebugAtomTest.class);

The sources for these debug data classes tests are found in the new cdk.test.debug package.

The number of JUnit tests for the CDK jumped from around 1250 to over 1500 tests right now. And if you think these new tests only test old code, because of all the super.bla() calls in the debug classes, you're way off. I found bugs in the new debug classes, but also many class cast bugs and several other problems in the real data classes!

Anyway. Does this help fix the ModelBuilder3D bug? Yes, it does:

$ grep "Setting ID" reports/result.modeling.builder3d.ModelBuilder3dTest.txt
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: carbon1
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: oxygen1
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: C
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HC
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HC
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HC
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: O
org.openscience.cdk.debug.DebugAtom DEBUG: Setting ID: HO

This shows me where the Atom ID is overwritten to be something other than "carbon1"! I can now look at the rest of the result.modeling.builder3d.ModelBuilder3dTest.txt file to see what the ModelBuilder3D was doing at the time, and which CDK class made the setID() call.

I only needed to change this line in the JUnit test for the bug to generate the above debug lines:

Molecule methanol = new Molecule();


Molecule methanol = new DebugMolecule();

Tuesday, December 13, 2005

Math libraries for Java?

I drop in on the #classpath channel of IRC network, where the #cdk channel runs too. The #classpath channel is for the Classpath project which is developing the free Java libraries used by most open source virtual machines.

A item was mentioned "Java Is So 90s". It lead to a funny discussion about what that would make C/C++ and Fortran. A more serious question was brought up: where are the efficient and super fast Java linear algebra and complex number libraries?

There is Weka but it is more aimed at data analysis. I believe it has support principle component analysis, so it must have singular value decomposition. There is a book called Java Number Cruncher: The Java Programmer's Guide to Numerical Computing by Ronald Mak, 2003, Prentice Hall.

After some further asking about it on the channel, they mentioned the Apache commons math project, which seems promising. The website mentions complex numbers, linear algebra, statistics and numerical analysis, but have not looked at the full API, so not sure how well populated these areas are.

Anyone, with experience in the area of numerical computing and Java?

Saturday, December 10, 2005

Jumbo 5.0 and the CDK

I reported earlier that the CDK has been updated in CVS to use CML from the new Jumbo 5.0. The transition actually involved a lot of changes in the CDK, some I would like to address in the following comments. One thing is that CML write support (not reading!) uses the new Jumbo library which requires Java 1.5. Thus, if Java 1.5 is not available, then CML writing should not be compiled. This is how this is done.

The JavaDoc

The CDK makes extensive use of JavaDoc taglets. CDK uses tags of type @cdk.SOMETAG. And an important tag in this case, is the @cdk.require tag, becuase it allows us to make the CDK build system aware that the class requires Java 5.0 to be compiled. Thus, we have for example this code in CVS, of which bits are:

* Serializes a SetOfMolecules or a Molecule object to CML 2 code.
* Chemical Markup Language is an XML based file format {@cdk.cite PMR99}.
* Output can be redirected to other Writer objects like StringWriter
* and FileWriter.
* @cdk.module libio-cml
* @cdk.builddepends xom-1.0.jar
* @cdk.depends jumbo50.jar
* @cdk.require java1.5
public class CMLWriter extends DefaultChemObjectWriter {

As probably is clear compiling this jars requires a two jars to be present, of which the jumbo50.jar itself is not required for compiling the class source code. It also shows the use of the @cdk.require tag.

The build.xml

Because the CDK still does not require Java 1.5, the CDK is supposed to be buildable with Java 1.4 (the oldest supported Java release). The Ant build.xml script is quite able to conditionally leave out compiling parts of the CDK, if configured correctly using proper JavaDoc tags, as explained earlier.

First, the build.xml checks what libraries are available for compiling certain parts of the CDK. For example, the build.xml code to check for Java 1.5 looks like:

<condition property="isJava15">
<contains string="${java.version}" substring="1.5"/>

Run ant info to see what is being checked for, or look at the build.xml source code for the check target.

All compiling is done by the compile-module target, and there it in- and excludes bits of the CDK depending on the checked conditions:

<javac srcdir="${build.src}" destdir="${build}" optimize="${optimization}"
debug="${debug}" deprecation="${deprecation}">

<excludesfile name="${src}/java1.4+.javafiles" if="isJava13"/>
<excludesfile name="${src}/java1.4.javafiles" unless="isJava14"/>
<excludesfile name="${src}/java1.5.javafiles" unless="isJava15"/>
<excludesfile name="${src}/ant1.6.javafiles" unless="hasAnt16"/>
<excludesfile name="${src}/r-project.javafiles" unless="rispresent"/>

<includesfile name="${src}/${module}.javafiles"/>

Keep in mind that the *.javafiles are created with JavaDoc based on the CDK JavaDoc tags mentioned earlier.

The build.xml 2

While the above mechanism has been present since for some time now, having jumbo50.jar in CVS made the situation a bit trickier: the jumbo50.jar uses the 49.0 class format used in Java 1.5, and cannot be processed by Java 1.4 systems. Since the classpath used when compiling CDK source code, is defined in configuration files for those modules in src/META-INF, the problem did not occur when compiling the modules. However, it did show an error in the reallyRunDoclet target today, when I was creating the *.javafiles with JavaDoc. The solution was trivial:

<target name="reallyRunDoclet" id="reallyRunDoclet"
depends="compileDoclet" unless="dotjavafiles.uptodate">
<javadoc private="true" maxmemory="128m">
<fileset dir="${lib}">
<include name="*.jar" />
<!-- some jars require some Java version -->
<exclude name="jumbo50.jar" unless="isJava15"/>
<fileset dir="${lib}/libio">
<include name="*.jar" />
<fileset dir="${devellib}">
<include name="*.jar" />

<doclet name=""

<packageset dir="${src}">
<include name="org/openscience/cdk/**"/>



There is another area of interest: the FileConvertor, which is, sort of, CDK's OpenBabel's babel variant. The FileConvertor must be compiled in all cases, so we need to conditionally instantiate the CMLWriter, which is not really a problem. However, compiling the source code is more troublesome: the CMLWriter class must be loaded on runtime, and not occur hardcoded in the source code.

In the past I have solved this by using .getInstance() constructs, but the ChemObjectWriter interface does not define this functionality, so I decided to use the java.lang.reflect mechanism:

} else if (format.equalsIgnoreCase("CML")) {
Class cmlWriterClass = this.getClass().getClassLoader().
if (cmlWriterClass != null) {
writer = (ChemObjectWriter)cmlWriterClass.newInstance();
Constructor constructor = writer.getClass().getConstructor(new Class[]{Writer.class});
writer = (ChemObjectWriter)constructor.newInstance(new Object[]{fileWriter});
} else {

Now, this has been, by far, the longest blog item I have written so far. I hope it gave you good insight in some techniques CDK uses to deal with situations where functionality might, or might not, be present at build and at run time.

Thursday, December 08, 2005

Jumbo 5.0 and CML support in CDK

Tobias commited Jumbo 5.0 to CDK CVS, so that the CDK is now again up to date with the latest CML library. Note that Jumbo 5.0 requires Java 5.0.

At first all JUnit tests seems to work, but apparently the CML2Writer tests were skipped because they were only run when Java 1.4 was found. I updated the test for the a appropriate Java version, and then it turned out that most tests fail. So those running CDK from CVS and depent on CML writing: hang on, it will be fixed very soon.

Tuesday, December 06, 2005

UML diagram of CDK module dependencies

The code clean up after CDK's interfaces transition is in progress, and two CDK modules are now independent of the data module. After doing the core module, the standard was next, and I finished this yesterday. The dependencies in CVS now look like (click it to get a larger view):

This UML diagram was made with , and the source is in XMI in CVS.

I cannot stress enough the advantages of these changes:

  1. the code is cleaner
  2. module dependencies are cleaner
  3. impossible to use methods outside the interface
  4. the algorithms are independent of the data classes

The last advantage is really important: it allows alternative implementations of the data classes. For example, we could make debug data classes, which, unlike the normal classes, do all sorts of checks when using methods of these classes. For example, they can explicitely check that parameters are not null, of the right class, and generally make sense. This makes them, possibly, slower, but also more type save, and as such great for debugging and development sessions.

Another important application of making the CDK library independent of the data classes (and only depending on the interfaces), is that we can have data classes shared with other Java libraries, such as JOElib, Octet, CML (Jumbo 5.0 is out!), and even proprietary libraries. This approach is already used in the CDK-Taverna library, and I anticipate much wider use with the arrival of Bioclipse.

Sunday, December 04, 2005

Planet Blue Obelisk website updates

After requests I added yesterday more visible the RSS and Atom feeds for the Planet Blue Obelisk. They are linked in the menu on the right, and as alternative links to the document. These should show up in most recent webbrowsers as feed icon in the lower right corner of the browser window. It is often an orange icon. I also added a 'Leave a comment' link to encourage people to leave comments on items. Please do!

Saturday, December 03, 2005

About JChemPaint's future and todays 2.1.5 release

Stefan has done an excellent debugging week on JChemPaint, while I have been late with a 2.1 release. Anyway, I've just uploaded a Java 1.4 compiled JChemPaint 2.1 series release. I was told the (reported) bug count is down to one, so I expect to see the next stable branch to be released soon (2.2 series).

But what after JChemPaint 2.2 gets released? Will a 2.3 developers branch be opened? Or will the JChemPaint application, as we know it, cease to exist, and make place for the Bioclipse JChemPaint plugin, that is being worked on?

It is worth mentioning the pros and cons of JChemPaint. One big pro is the applet version of JChemPaint, though free but closed source alternatives are available (e.g. MarvinSketch). Another advantage is the great semantics of the chemistry being drawn. For example, when drawing reactions, reactants are really marked as reactants, and are not just molecules left of an arrow. Moreover, JChemPaint is a great platform in which ideas can be tested! One of the key virtues of opensourceness. Cons include the limited amount of templates, print quality graphics, and others. (Comments on JChemPaint most welcomed.)

So what about this Bioclipse then? It is inheritently SWT based, but currently the SWT_AWT bridge is used to embed to current JChemPaint and underlying CDK code as is. Unfortunately, this bridge is using proprietary code from Sun (sun.awt classes), which makes it impossible to use with free virtual machines.

But there is also the option of using the SWT drawing classes. This has the advantage that it can be run with free virtual machines, and that it can even be compiled to native code. It requires serious rewriting of code in the JChemPaint and CDK code base. But, CDK's Renderer2D needs a rewrite anyway: it does not even use Swing's Java2D efficiently (try to figure out how it transforms atomic 2D coordinates into screen coordinates!). Some efforts have been ongoing, but a rewrite from scratch, with a better, more modular, design cannot hurd at all.

Wednesday, November 30, 2005

Getting Started with Eclipse and the SWT

Getting Started with Eclipse and the SWT is a very nice set of introductory tutorial on working with SWT and Eclipse in general. The tutorials cover the basic, advanced SWT widgets, SWT layout, and several other interesting topics.

Now that Bioclipse is gaining speed, it is a must-read.

KDE 3.5 is out

KDE 3.5 was released with lots of changes. SuperKaramba is now a standard KDE application and is neatly integrated. It allows embedding themelets on your desktop background:

It shows several themelets: the weather, a calender, a toolbar with applications, a FoldingAtHome monitor, the contents of the clipboard, the music that is playing (Cake) and a simple todo list. All customizable up to the pixel.

And before I forget: a nice new Kalzium release!

Monday, November 28, 2005

A Blue Obelisk blog Planet

Today I setup a blog planet for Blue Obelisk members. First I tried Chumpologica but it did not read Atom feeds.

Next in line was Planet, which turned out to be used by many big planet sites, like Planet Debian. It also works with Atom feeds in general, but not well with Atom 1.0 feeds, like that of Carsten. After some googling I found a patched version which did the job.

The result is at, but I hope that someone can arrange a

Sunday, November 27, 2005

Open Source Swing: Jmol renderer runs!

Where I was able to mention earlier that JChemPaint now runs with free (as in open source) Java virtual machines, I just tried to run the core Jmol renderer, using the which comes as an example:

The screenshot was made with jamvm 1.3.3 and classpath 0.19.

It is very slow, however. I have not tried it with other free virtual machines, which are supposedly faster. It is a good start nevertheless: it means that a Jmol based Bioclipse plugin will work with free virtual machines too.

Update: added a nicer screenshot.

Wednesday, November 23, 2005

Machine crash; SVN went along

Don't happen often, but my machine crashed two hours ago. Not a big deal, because I have my important files in SVN. Oh wait, SVN had a commit in progress during the crash. So, svn recover. Mmmm... doesn't work either. OK, SVN FAQ: try db_recover. That worked. No, it did not: svn commit still not working for the files I was trying to commit. Fortunately, I make regular SVN db backups so I created a brand new SVN repository from scratch and recovered the back up. That worked. Really.

Monday, November 21, 2005

Bioclipse: the chemo-/bioinformatics workbench

Some weeks back there was the CDK5AW, the CDK 5th anniversiry workshop. A small group of international open source chemo-, bioinformatics software developers met, among which two from Sweden. It was then decided to generalize their work resulting in Bioclipse:

It's heavily using the Eclipse Rich Client Platform, making additional plugins trivial. OK, if this does not convinve you: check the screenshots on the Bioclipse website.

It's a killer, really! Ola, Martin: great work!

PS. I am going to try to run it with free Java virtual machines this weekend, but if you have a working solution earlier than that, please leave a comment and screenshot in the comments.

Sunday, November 20, 2005

Open Source Swing: JChemPaint runs!

Thanx to Mark's encouragements, I tried to run Jmol and JChemPaint with jamvm.

Jmol fails with an NullPointerException, but JChemPaint runs! And note that this was not even running with the latest of the latest; just recent packages from Kubuntu! Yes, there are some glitches, but I'm happy nevertheless!

Friday, November 18, 2005

The goal: a live chemblaics CD

This evening I have been looking at with the KNOPPIX customization howto, and ran many of the interesting commands. I've setup a environment with Kalzium, OpenBabel, CDK, jython, PyMOL, and for development I included gcj and Eclipse. At some later point I will include kfile_chemical too, but I want to make a deb package first.

Moreover, I also wanted it to include JChemPaint, Jmol and Taverna (with the CDK extension). However, these depend on Swing, which is not suffiently provided by open source java virtual machines. I attempted gij 4.0, kaffe and sablevm, all without success.

A live CD with all the open source chemo- and bioinformatics tools would be a real killer. We could take a burned live CD with us to conferences and have others run our software on their laptop! But we need to stop use Swing. Fortunately, there seems to be a serious project going on to port JChemPaint and Jmol to a free Java GUI environment, so maybe we can have the live CD up and going before the 2006 conferences start.

Thursday, November 17, 2005

Back from the 1st GCC

OK, just back from the first German Chemoinformatics Conference, which I enjoyed very much. A rather interesting program, and lots of interesting posters too. You can read the programme online, and will not spend too many words on that (at least not now). But what I will do is point out some interesting posters here.

One poster was on the Molecular Query Language (MQL) by Ewgenij Proschak from Frankfurt. You can read more on this in the latest CDK News as it is implemented for the CDK too. The opensource implementation is expected next year.

Another interesting poster was on the use of ontologies to connect chemistry and biology. This poster was by Juergen Harter from BioWisdom, a Cambridge, UK based company.

Marc Zimmermann had a poster on the chemical OCR variant, called chemical structure recognition (CSR). This process converts images, for example scanned from literature, into a connectivity table. Difficult task, indeed. This page contains some information about this project.

There were other interesting posters too, so will probably report on those later too. But do feel free to leave comments to this blog post, discussing other interesting posters.

Friday, November 11, 2005

Going to the German Chemoinformatics Conference

This sunday starts the first German Chemoinformatics Conference in Goslar. It's an interesting programme, with presentations on the InChI, PubChem, 25 years of chemoinformatics, the chemical semantic web, and much more.

Among these presentations is mine, on comparing crystal structures (PDF) and deducing cell parameters. But I'm having a poster on QSAR too.

I'll arrive on saturday afternoon in Goslar, so leave a message at the conference hotel if you want to meet up, and talk about my work, or yours, or the CDK, KDE, JChemPaint, Jmol, kfile_chemical, Kat/Chemistry, BlueObelisk, Eclipse, R, or whatever else... I plan to have a modest german meal and one or two beers in the evening.

BTW, after Belém (Lissabon), Sintra, Boppard, Kinderdijk, Hoorn and Cologne, it's the 7th UNESCO world heritage site I'm visiting in just 14 months! Can't we just have conferences in Hawaii and sorts, like they do in other fields?? Oh, wait, we do: EuroQSAR is on a cruise boat.

Thursday, November 10, 2005

Scons and bksys for kfile_chemical

Not so long ago, it was decided that KDE 4.0 will use SCons as a configuration and building tool, instead of the autotools and make: the common ./configure && make && make install which has served the open source community very well for so long.

SCons is different in several ways. One of these is that the tar.gz packages it produces are some 500kB smaller, which makes a huge difference for kfile_chemical which is now 121kB instead of 635kB.

Now, the KDE community, or Thomas Nagy to be precise, developed a helper for KDE software, called bksys. Version 1.5.1, however, did not contain an example directory for kfile plugins, but I managed to work something out starting from the configuring scripts from kdissert, and ended up with these SConstruct and config.bks.

Now, I haven't figured out how to include the translations, but will figure that out sooner or later... for now, I'm quite happy with the new build system.

Tuesday, November 08, 2005

A R GUI: rkward

The great thing about open source is that... it's open.

When I was browsing the internet just now, I dropped in on KDE Dot News. In the rightside column, there is a feed of new KDE software from A new version of my favoriate music player,
amarok, lured me to the KDE-apps website, where I saw rkward is latest announcement. The funny name, and the categorization as scientific, triggered some interest on my side, and it turned out to be a graphical frontend to my favorite statistics program, R.

Ok, they had a Debian package, and the debian/ build dir in the tar.gz so I downloaded it and started making a Kubuntu 5.10 package. While doing this I saw some notice about the R syntax highlighting used, which conflicts with the older version in the Kate packages.

Then I realized that a long time ago, I wrote such syntax highlighting for Kate, so my attention was lured again. And, indeed, they use my syntax highlighting, though extended later (somewhere down the page).

And this makes me happy. The syntax highlighting was useful to me in the past, but apparently to a lot of other people too. And because I released it as GPL, back then, it now appears in rkward! Yes, a really like open source :)

When to stop including QSAR model variables...

Yesterday I reviewed an article which published a QSPR model which looked something like:

y = 151 + 50p1 - 12p2 - 0.006p3

with quite OK prediction results (R=0.9880). But I was not quite comfortable with the coefficient for the p3 variable. The article did not calculate significances for the coefficients, so it was not obvious from the article wether is was useful to include them. I then looked at the range for p3, which was 110-150; so, the maximal influence this variable can have is 150*0.006 = 0.9. Now, the experimental values given in the article were rounded to integers, indicating that the maximal
effect of the p3 variable is smaller than the experimental error! It's even worse when you consider the difference between the min and max value (40), then the influence would even be smaller (assuming that most model methods would put the mean temperature effect in the offset, 151 in this case).

Today, I reread an article with a similar issue. The model was something like:

y = -0.81 + 0.03*p1 + 0.009*p2

Here, max(p2)-min(p2) is a smaller than 100, so the maximal effect of the variable would be in the order 0.9, which is of the same order of the root mean square error of prediction (RMSEP) for this model. Indeed, the article already states that the coefficient is only significant at the 95% level, and not at the 99% level. But, without having calculated the RMSEP for a model without the p4 variable, I would guess that leaving it out would give equally good prediction results.

Concluding, I would say the the p2 variable does not include relevant information.

Do you think it is reasonable to include the p2 variable in the second model?

Monday, November 07, 2005

Ubuntu Dapper will include chemistry features

I just read that the Kubuntu team wants to include Kat in the dapper release (scheduled for April 2006). Kat is (to be) the KDE equivalent of Google's desktop search bar.

This is great news for us chem-bla-icians, as Kat has support for full text searching of chemistry files! Let's see if I can get the Kubuntu team to package up kfile_chemical too, which will extend Kat (and KDE in general), with extraction of meta data from chemical documents.

Update: Dapper will be released next year, not in 2007.

Wednesday, November 02, 2005

Open Source data mining in chemoinformatics

On the 7th International Conference on Chemical Structures Jeroen Kazius has a poster on finding discriminative substructures, that is, molecular fragments which can be discriminate between two acitivity classes. The software is released as Gaston, is written in C++ and has the GPL license.

Later I encountered MoSS which has the same goal, but uses a different algorithm. MoSS is written in Java and uses the LGPL license. MoSS reads STN and SMILES as input, which might not be optimal for all users, so a CDK port comes to mind.

R/CDK install fails on GCC 4.0 systems

Some time ago Rajarshi Guha introduced R bindings for the CDK (see his CDK News articles), and today I tried to install his rcdk package that makes it happen.

However, it requires SJava which compiled fine on other machines, but not on my AMD64 machine. The problem seems to be related to the GNU GCC 4.0 compiler I have installed. Compiling with 3.4 works fine, but 4.0 complains with:

CtoJava.cweb:215: error: static declaration of 'std_env' follows non-static declaration
CtoJava.cweb:195: error: previous declaration of 'std_env' was here

Googling, learned me that I am not the only one with this problem, but did not find any solution. If you know how to fix this problem, please leave a message in the comments.

Tuesday, November 01, 2005

The annual Lunteren meeting

Most Dutch chemists have their annual Lunteren meeting, so do I. Lunteren is a small village on the Veluwe where nothing much can be done, except for listening to the presentations. I participate in the Lunteren meeting for analytical chemists, i.e. HPLC, MS, GC and all their combinations upto and including HPLC/MS/MS, and since a few years the Lab-on-a-Chip stuff. And, as such, in many cases a lot of details on how to use and develop these methods.

For a computational chemist, this often is too much practical detail on too little -ics. Fortunately, the proteomics, genomics, etc is a strong upcoming funding subject, so data analysis is getting in their picture too. Which is good for someone with a chemometrics/chemoinformatics background as funding in that area is getting smaller every year.

My presentation went reasonable well, as far as I can tell myself. I was very nervous with both my professor and some 150 other people in the audience, but managed to not wander off the main topic. However, I was told to be a bit too monotone, but that's an unfortunate effect of being so nervous.

Sunday, October 30, 2005

CDK News

Just finished applying the latest spelling error fixes to CDK News 2.3. Took me some three hours to finish it up the 12 pages, which has mostly to the need to recompile the PDF after each change to make sure that nothing in the layout got broken.

The content contains four communications:

  • An Open Framework for Online QSAR Modeling
  • Atom types in the CDK
  • MQL - Development of a novel substructure query language
  • Stereochemistry detection in the CDK

And, ofcourse, the recurrent Editorial, FAQ and ChangeLog.

Saturday, October 29, 2005

kfile_chemical gets XYZ, Mol2, SMILES, VMD and GenBank support

Jerome Pansanel contributed new patches for kfile_chemical; on monday actually, but I have been busy with other things, among which a presentation I have to give next monday for some 100+ analytical chemists. The patch adds support to KDE for five new chemical MIMEs: XYZ, Mol2, SMILES, VMD and GenBank. Therefore, I just released a new version (0.10), and added an announcement to

As a reminder, version 1.0 will have all chemical mime types supported, after which I will initiate a process to formalize the meta data we want the kfile plugins to give, which will lead to the 2.0 release. So far, I had in mind that the next step was to make the plugins ready for KDE 4.0, but I became aware of the mime magic as implemented in KMimeMagic.

So, concluding, I might squeeze in another beta release 3.0, where this magic gets addressed; knowing that it will definately not work for all files, but hopefully it will for files with stupid file extensions like .log.

Thursday, October 27, 2005

My birthday (31) and the Adsense

Today is my 31st birthday, nearing half-point now (statistically seen). Also, by now I should have had my scientific moment of glory, otherwise I can forget that Nobel prize. Oh well, forget it.

Have you seen those small advertisements on this page (RSS users, please visit the website :)? Funny links they give. The system is very nice btw: it awaits google indexing of the blog and then decides which ads are relevant. Hence, the links to small chemoinformatics companies. Nice to browse.

Disclaimer, when clicking any or all of the ads, I'll get a bit of money. But don't start clicking away, otherwise Adsense will get upset, and then I get nothing.

Tuesday, October 25, 2005

More cdk.interfaces updates

Yesterday I had some spare time before going to a meeting about the Woordenboek Organische Chemie, so I was boldly going where no one has went before: getting the CDK module core independent of the data module. Why, you might wonder...

Well, if the as many modules of CDK become independent of the classes implementing the data interfaces, i.e. those classes that implement the org.openscience.cdk.interfaces interfaces, then it becomes possible to make alternative implementations. For example, an implementation that also implement the Octet interfaces, or an implementation that extends the JOELib classes. In that way, combining these libraries becomes as easy as writing a blog :)

Anyway, today I finished the AtomTypeFactory, and only the IstopeFactory remains to be updated. Since many classes in the CDK library use these two classes, patches had to be applied throughout the library. And code outside the CDK library might be broken now, so be aware...

Monday, October 24, 2005

JChemPaint applet download size: 538kB

A good functional molecular editor is of much important to the chemical web. There are a few small download sized editors around. JChemPaint has been available as applet for some time now, but the download size has been large. The situation has improved considerable over the past months, and the download size upon which the applet now shows up in your webbrowser is down to 538kB. A live demo is available from

The applet, however, does have the same functionality as the full application. When a feature is used that is not available from the jars downloaded first (which make up the 538kB), additional jars are downloaded.

The applet is not bugless yet. For example, drawing reactions does not seem to work :( But, it's really getting somewhere. Congrats to the applet development team!

Sunday, October 23, 2005

Wrapping up...

Less then three months before the end of my contract of my PhD project. And not nearly done yet. Weekends are now spend on wrapping up bits of experimental research into something like a coherent article. And even lot's of calculations to do to answer the open questions. FreeMind is helping me organize thoughts.

Opensource chemoinformatics is a welcomed diversion now and then. Working on some easy-to-fix CDK bugs yesterday, like the QueryAtomContainer now correctly updated for the recent cdk.interfaces changes. Fixed now. I also touched a lot of code when updating the FSF address in the LGPL license notice, and when I modified the construction of CDKException's to set the causing Throwable. Also helped out Carsten a bit with adding his data from Kalzium to the Blue Obelisk data repository.

Another nice diversion is The Battle for Wesnoth. Just got killed, though.

Friday, October 21, 2005

Viagra saves the environment

This week there was an interesting article in the Dutch Intermediar about viagra. They cite an article in Environmental Conversation and state that it saves the environment as it greatly reduced the market for animal parts from the traditional chinese medicine that address the same problem as viagra does.

Viagra: good for the environment, good for you! ;)

You don't see this often, though. Public opinion, at least in my social environment, is that chemicals (in general) are bad for the environment, what so ever... Natural products are much better. Wait, those are chemical too... but that is to complicated for most :(

BTW, viagra is InChI=1/C22H30N6O4S/c1-5-7-17-19-20(27(4)25-17)22(29)24-21(23-19) 16-14-15(8-9-18(16)32-6-2)33(30,31)28-12-10-26(3)11-13-28/h8-9,14H, 5-7,10-13H2,1-4H3,(H,23,24,29)/f/h29H.

Thursday, October 20, 2005

CDK News 2.3 and InChI's

CDK News 2.3 is scheduled for this month, and origanally planned to be distributed on the CDK5AW event. So, it's a bit late. But the editorial process is converging... I realized that I forgot to mention the requirement for InChI's whenever molecules are given. So, I'm now in the process of going through the issue and add the missing identifiers...

Wednesday, October 19, 2005

Jmol's FAH team in Top 800

The Jmol FAH team has just entered the Top 800 of most active Folding@Home teams. And for that's the point where they start monitoring contributions on a user level. Thus, I can now see how active I am within the team. And so can you! Join the team, and let's get into the Top 500!

InChI meta data with kfile_chemical

I've just uploaded kfile_chemical 0.9. It has new translations for ES and DA, and plugins for InChI files. It will extract the InChI string as meta data (and will thus be used by the KDE desktop search Kat), and the InChI version number.

Thinking about this, it might be useful to extract all layers as meta data, so that one can search on chemical formula and even connectivity, and find all matching structures. Not really close to substructure search, but we'll tackle that later :)

Tuesday, October 18, 2005

CDK-Taverna fully recognized

After asking about it, Tom explained me how Taverna can pick
up the apiconsumer.xml file from jars: just copy it into the root directory of the jar package. Easy as that.

So, users now only need to copy the cdk-taverna.jar into the taverna-workbench-1.3/lib/ directory and have a nice chemoinformatics workbench environment. I'll upload the jar to
CDK's project page right now.

Monday, October 17, 2005

CIA statistics for Blue Obelisk

I have just enabled CIA statistics for the Blue Obelisk SVN: /stats/project/cdk/blueobelisk.

It's done by using the client script and hooked into the $REPOS/hooks/post-commit hook on the SVN server. The client script is slightly hacked to hard code the module name, which otherwise did not show up on the
chat channel.

Saturday, October 15, 2005

Single PDFs for CDK News articles

This week was the CDK5AW event, a workshop for users and developers of the Chemistry Development Kit (CDK). After talking with other developers we agreed on creating PDF and HTML versions of single articles that appeared in the CDK News newsletter. Well, I haven't figured out how to create nice HTML (the latex2html does not give nice results, anyone ideas?), but for the PDF version I now have a pipeline.

For each article, a split.config file determines which pages from the CDK News issue PDF should be extracted. To do this, I used the PDF ToolKit, or pdftk for short (comes with Debian/Unbuntu by default). And using a Perl script to read this config files, the pipeline creates PDF files for each article. Currently, I'll only have it do the features articles; that is, not the ChangeLog, Editorial, Literature and FAQ. For those you'll need to download the full issue. If you don't like that, let me know :)

Ok, you will probably have noticed that the almost server is down (Googling for 'CDK News' allows you read the cache!), and I the PDF's will be uploaded there asap. For those not familiar with CDK News, the articles are FDL, so feel free to copy and distribute them. If you reuse the text and update it, which is allowed too, please let us know.


This new blog will deal with chemblaics in the broader sense, and will not be restricted to research in this field in which I am involved personally.

Chemblaics (pronounced chem-bla-ics) is the science that uses computers to address and possibly solve problems in the area of chemistry, biochemistry and related fields. The general denomiter seems to be molecules, but I might be wrong there.

The big difference between chemblaics and areas as cheminformatics, chemoinformatics, chemometrics, proteochemometrics, etc, is that chemblaic only uses open source software, making experimental results reproducable and validatable. And this is a big difference with how research in these areas is now often done.