Tuesday, November 27, 2007

Be in my Advisory Board #1: being a good Open Science citizen

I recently saw that blogs gained a poll feature. From now on, I will try to be a bit more Open Science, in addition to Open Source. From now on, you can be in my Advisory Board. To do so, vote on my next chemblaics (aka Open Source Chemoinformatics) project. The poll can be found on the left side of this blog. Associated which each poll, which I may run more or less frequently depending on the time of year, will be one blog post where I introduce the options. Options not mentioned, or completely different things, you would like to suggest me to do, can be left as comments to these items.

Finishing the new JChemPaint code
Goal of this option is to use the code written by Niels in his ProgrammeerZomer project to implement a new JChemPaint based on Java2D and independent of the widget set used (Swing/AWT/SWT/...).

CML-roundtripping of the CDK data model
The goal of this project is to ensure that all information the CDK data model can hold can be roundtripped in CML.

Integrating InChI-NestedVM in Bioclipse
Rich is, besides an excellent blogger, also someone who is not afraid to try new things. Recently, he experimented with compiling the InChI library into a Java executable. Bioclipse already is able to generate InChIs, using the code written by Sam Adams for the CDK, but a InChI/NestedVM plugin for Bioclipse could make a nice show case.

Writing CDK News articles
On the other hand, you might find that I should focus on getting a new CDK News issue out, for which we are stilling lacking (finished) contributions.

It's up to you. Deadline in about two weeks; still got some other things to finish :)

Monday, November 26, 2007

Metabolomics workflows in Taverna

My current jobs description is to speed up metabolomics data analysis, and finally got around to making a first relevant workflow for Taverna, using the webservices just posted over at ChemSpider:

I uploaded the source to MyExperiment, so anyway can play with it. There is much to improve, such as using CDK-Taverna for further analysis of the results.

I am not sure if opening the workflow in your Taverna installation will automatically set up the WDSL scavenger for the ChemSpider services, which are available in a HTTP version too, btw. If not, right click on the Available Processors folder, and pick Add new WDSL scavenger... and point it to the URL The result should look like:

Oh, and please note this comment:
    These services are offered free of charge to our users during this period of testing, validation and feedback. Some of these services will be made available commercially in the future and we are proactively informing you of our intention to do this. It is likely that these services will remain available to academia at no charge. Please contact us at feedbackATchemspiderDOTcom with feedback and questions.

So, I do not know when my workflow will stop working.

Thursday, November 22, 2007

MetWare: metabolomics database project started on SourceForge

The Applied Bioinformatics at PRI group where I now work in Wageningen and the group of Steffen Neumann in Halle have started the MetWare project on Sourceforge to develop opensource databases for metabolomics data.

The databases design will be based on and ideally compatible with proposed standards like ArMet (DOI:10.1038/nbt1041) and those recently written up by the Metabolomics Standards Initiative (see the issue around DOI:10.1007/s11306-007-0070-6).

One important design goal is that the project will use BioMart, which will allow easy integration of the database content in data analysis programs like Taverna and R using the biomaRt package (see DOI:10.1093/bioinformatics/bti525).

Though the software will be opensource, it is yet unsure how much data will be open.

Tuesday, November 20, 2007

When standards fail...

Jim shows that some people do not think webservices standards are complex enough in itself:

Monday, November 19, 2007

An R-based genetic algorithm

During my PhD I wrote a simple but effective genetic algorithm package for R. Because there was a bug recently found, and there is interest in extending the functionality, I have set up a SourceForge project called genalg.

The package provides GA support for binary and real-value chromosomes (and integer chromosomes is something that will be added soon), and allows to use custom evaluation functions. Here is some example code:
# optimize two values to match pi and sqrt(50)
evaluate <- function(string=c()) {
returnVal = NA;
if (length(string) == 2) {
returnVal = abs(string[1]-pi) + abs(string[2]-sqrt(50));
} else {
stop("Expecting a chromosome of length 2!");
monitor <- function(obj) {
# plot the population
xlim = c(obj$stringMin[1], obj$stringMax[1]);
ylim = c(obj$stringMin[2], obj$stringMax[2]);
plot(obj$population, xlim=xlim, ylim=ylim, xlab="pi", ylab="sqrt(50)");
rbga.results = rbga(c(1, 1), c(5, 10), monitorFunc=monitor,
evalFunc=evaluate, verbose=TRUE, mutationChance=0.01)
plot(rbga.results, type="hist")
plot(rbga.results, type="vars")

Friday, November 16, 2007

Molecules in Wikipedia without InChIs #3

Third in the series of blogs about molecules in Wikipedia without an InChI (see also #1 and #2). There a certainly false positives, but here's the updated list:'s_catalyst

Wednesday, November 14, 2007

Last Call for Open Laboratory 2007

Pedro reminded me of the last call for Open Laboratory 2007, which prints the best blog items of 2007 in book form. The list of chemistry contributions is not so large yet, so go ahead and nominate some of cool chemical blog items of the last year.

I will post my shortlist later this week.

Monday, November 12, 2007

Scintilla and on Linux 2.6.17+

That's why blogging works! I reported last Friday on using my Wii for reading Scintilla and Alf replied:
    It is the Linux kernel, yes: TCP window scaling was switched on by default in kernels since about a year ago (and in Vista too, I think), and one of our routers or firewalls doesn't like it. We're trying to get them upgraded, but it takes a while...

Ah, the trick word: TCP windows scaling. A quick google turned up a workaround in John's Tidbits blog:
    There are 2 quick fixes. First you can simply turn off windows scaling all together by doing

    echo 0 > /proc/sys/net/ipv4/tcp_window_scaling

    but that limits your window to 64k. Or you can limit the size of your TCP buffers back to pre 2.6.17 kernel values which means a wscale value of about 2 is used which is acceptable to most broken routers.

    echo "4096 16384 131072" > /proc/sys/net/ipv4/tcp_wmem
    echo "4096 87380 174760" > /proc/sys/net/ipv4/tcp_rmem

    The original values would have had 4MB in the last column above which is what was allowing these massive windows.

    In a thread somewhere which I can’t find anymore Dave Miller had a great quote along the lines of

    “I refuse to workaround it, window scaling has been part of the protocol since 1999, deal with it.”

That worked for me. I think Dave Miller is right, but can't resist reading Scintilla and on my desktop too ;)

Friday, November 09, 2007

Using the Nintendo Wii for serious science...

On my desktop, the Scintilla and websites do not work. It is not a browser problem, but has something to do with TCP/IP packages not reaching its destination: the browser. Euan told me they are aware of the problem, but apparently have not found a solution yet.

However, my Wii does not have the problem, which makes me wonder if it is a disagreement between the Nature server and my Linux kernel... Anyway, this is what the two website look like (first Scintilla, then

(BTW, that was one very nice piece of work by Rich! Make sure to also read the follow up.)

The only real disadvantage is that it does not integrate well with the things I do daily. If I see some interesting post, and would like to tag it on my account, I have to google for it on my desktop :(

(You thought I was going to talk about F@H or so, didn't you? :)

Thursday, November 08, 2007

Cytoscape in Amsterdam

Right at this moment I am listening to Andrew Hopkins from Dundee on chemical opportunities in system biology, at the Cytoscape conference in Amsterdam. Anyone who wants to meet up over lunch or coffee break?

Wednesday, November 07, 2007

Comparing JUnit test results between CDK trunk/ and a branch

I have started using branches for non-trivial patches, like removing the HückelAromaticityDetector, in favor of the new CDKHückelAromaticityDetector. I am doing this in my personal remove-non-cdkatomtype-code branch, where I can quietly work on the patch until I am happy about it. I make sure to keep it synchronized with trunk with regular svn merge commands.

Now, the goal is that my branch only fixed failing JUnit tests, not that it creates new regressions. To compare the results between two versions of the CDK, I use these commands:

$ cd cdk/trunk/cdk
$ ant -lib develjar/junit-4.3.1.jar -logfile ant.log test-all
$ cd ../../branches/egonw/remove-non-cdkatomtype-code/
$ ant -lib develjar/junit-4.3.1.jar -logfile ant.log test-all
$ cd ../../..
$ grep Testcase branches/egonw/remove-non-cdkatomtype-code/reports/*.txt | cut -d':' -f2,3 > branch.results
$ grep Testcase trunk/cdk/reports/*.txt | cut -d':' -f2,3 > trunk.results
$ diff -u trunk.results branch.results

The last diff commands gives me a quick overview of what has changed. See get the statistics, I can do:

$ diff -u trunk.results branch.results | grep "^-Testcase" | wc -l
$ diff -u trunk.results branch.results | grep "^+Testcase" | wc -l

The first gives me the number of JUnit tests which are now no longer failing, while the second
gives me the number of tests which are new fails. Ideally, the second is zero. Unfortunately, not yet the case :)

Tuesday, November 06, 2007

Evidence of Aromaticity

I have been working on a new atom type perception engine for the CDK, after having decided that the existing atom type lists where not sufficient for the algorithms we have in the CDK. The new list is growing in size, and basically contains four properties (besides element and formal charge):
  1. number of bounded neighbors
  2. number of pi bonds (or double bond equivalents)
  3. number of lone pairs
  4. hybridization state
This seems to be a minimal and accurate set to cover a rather good deal of chemoinformatics. I have yet to make the mappings of the new atom type list with existing lists for force fields, and radicals are missing too. However, the following algorithms in the CDK seem to translate rather well:
  • hydrogen adding
  • aromaticity detection (Hückel rules)
I still have to rework the double bond perception.

Now, aromaticity is a fuzzy concept, and there is no general agreement on what it is. Some say it is smelly compounds, others say ring systems which apply to the Hückel rule. Based on the new atom type list, I have rewritten the Hückel aromaticity detector and it applies these rules:
  • only single rings and two fused non-spiro rings
  • 4n+2 electrons
  • no ring atoms with double points not in the ring too
This approach differs in two ways from the old code: it no longer tries to test all ring systems, which required to use the CDK AllRingsFinder algorithm which combinatorial generates all possible ring systems. The new code only considers ring systems with up to two single rings. Aromaticity beyond that is even less well defined than aromaticity in general.

The other difference is that the ring system must not have ring atoms which have a double bond which is not part of the ring too. The classical example is benzoquinone (InChI=1/C6H4O2/c7-5-1-2-6(8)4-3-5/h1-4H) which is not aromatic, even though it conforms the 4n+2 rule (image from PubChem):

Evidence of Aromaticity
The final rule, of course, is what nature tells us what is aromatic and what is not. There are many other details to aromaticity than I just covered. For example, take azulene (InChI=1/C10H8/c1-2-5-9-7-4-8-10(9)6-3-1/h1-8H). All atoms are aromatic, but not all bonds (also PubChem):

These things are complex, but the rise of Open Data helps us out, as well as increasing computing power. Peter has been running two rather projects which may help us out: CrystalEye (Nick: no blog?) and OpenNMR.

NMR shifts will give us experimental backup on our notion of aromaticity, and so do bond lengths. I asked Peter about this, and whether OpenNMR predicted shifts could indeed confirm aromaticity of compounds, and he replied and showed that the predicted spectra could be used to distinguish between C-C and C=C bonds.

I commented the following (which was in moderation at the time of writing), and that gets us to experimental evidence for aromaticity:
    Thanx for the elaborate answer. What I had in mind was the question whether NMR shift predictions can be used to tell me if a certain ring system is aromatic or not, and in case of fused rings, which atoms and which bonds are aromatic and which not. I’m sure the prediction error for 1H NMR shifts is well below 2ppm, and more in the order of 0.2ppm.

    But maybe I should be asking, can I use CrystalEye to decide if ring systems are “aromatic”, and in case of two rings fused together (non-spiro), which atoms and bonds are aromatic and which not. Aromaticity is a fuzzy concept, with various definitions. I would be interesting in linking what the expert considers ‘aromatic’ (or SMILES, or the CDK, or …) with what the QM chemistry (via bond lengths or NMR shift predictions) and crystal structures (via bond lengths) has to teach us. The null hypothesis being that the bonds are not delocalized (bond length) and that no ring current is found (NMR shifts, 1H in particular).

    Regarding those bond lengths, ‘aromatic’ bonds show a bond length in between that of single and double bonds (e.g. see this random pick). The CrystalEye data does not reflect that really, and only a trimodal histograms shows up. Indeed, the C#C peak is *very* low, around 1.2A :) Apparently, the triple C#C bond order is underrepresented in nowadays crystallography.

    Maybe aromatic C:C bonds are underrepresented too, or can the absence of a peak around 1.40A be explained otherwise? I would at least have expected a shoulder or deviation in peak shape of the peak at 1.37A.

This is what the histogram looks like (for archival reasons):

Monday, November 05, 2007

Glueing BioMoby services together with JavaScript in Bioclipse

Ola has been doing a good job of integrating BioMoby support into Bioclipse. Earlier he completed a GUI for running BioMOBY services, and added more recently a JavaScript wrapper too, using the Rhino plugin developed by Johannes.

For example:
       console =; 
    moby =;
    biojava =;



Today he explained how to create convenience JavaScript shortcut, to reduce the typing.

Screenshots and status of the Bioclipse-BioMoby work is available from the wiki.