Saturday, December 30, 2006

Modern chemistry in the CDK: beyond the two-atom bond

Rich recently blogged about the limitations of the two-atom bond representation often used in chemoinformatics, triggered by the four ferrocene entries in PubChem. In reply to himself, Rich described FlexMol, an XML language that can describe bond systems that involve more than two atoms.

Obviously, the problems originates from the lack of mathematical knowledge of chemists: the current chemoinformatics heavily depends on graph theory, where each atom is a vertex and each bond an edge. This has the advantage that we can borrow all algorithms that work with graph representations, such as Dijkstra's algorithm to find the shortest path between two vertices. Or, in chemical language, an algorithm to calculate how many bonds two atoms are apart in a molecule.

When discussing FlexMol, Rich mentions the work by Dietz (DOI:10.1021/ci00027a001), but I would like to mention the PhD thesis of S. Bauerschmidt to this (see DOI: 10.1021/ci9704423) done in Gasteiger's group.
Dropping this 'two-atom bond' representation in favor of something that better describes compounds like ferrocene, like the Dietz and Bauerschmidt approaches, has the unfortunate disadvantage of loosing compatibility with graph theory algorithms. Nevertheless, in order to take chemoinformatics to the next level, we have to address these issues. But hope is not lost, and people are working on rewriting our toolkit of chemoinformatics algorithms to match such new representations.


In will postpone analyzing the CDK for compatibility with such more modern representations (look out for a CDK News article), and now just describe how the CDK can be used for FlexMol/Dietz/Bauerschmidt representations. Consider the four examples Rich gives in his blog. Here are the CDK ways of doing the same.

For example, 1,3,5-cyclohexatriene:

public IMolecule makeCycloHexaTriene() {
IMolecule cyclohexatriene = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);
IAtom atomC5 = builder.newAtom(Elements.CARBON);
atomC5.setID("C5"); atomC5.setHydrogenCount(1);

IBond bondB0 = builder.newBond(atomC0, atomC1, 1.0);
IBond bondB1 = builder.newBond(atomC1, atomC2, 2.0);
IBond bondB2 = builder.newBond(atomC2, atomC3, 1.0);
IBond bondB3 = builder.newBond(atomC3, atomC4, 2.0);
IBond bondB4 = builder.newBond(atomC4, atomC5, 1.0);
IBond bondB5 = builder.newBond(atomC0, atomC5, 2.0);

cyclohexatriene.addAtom(atomC0); cyclohexatriene.addAtom(atomC1);
cyclohexatriene.addAtom(atomC2); cyclohexatriene.addAtom(atomC3);
cyclohexatriene.addAtom(atomC4); cyclohexatriene.addAtom(atomC5);

cyclohexatriene.addBond(bondB0); cyclohexatriene.addBond(bondB1);
cyclohexatriene.addBond(bondB2); cyclohexatriene.addBond(bondB3);
cyclohexatriene.addBond(bondB4); cyclohexatriene.addBond(bondB5);

return cyclohexatriene;

Summarizing, the key thing is to use the IBond.setElectronCount() method. The call is sort of redundant, as the CDK defaults to two electrons if not explicitly given. This compound is, of course, benzene which we can represent like this too:

public IMolecule makeBenzene() {
IMolecule benzene = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);
IAtom atomC5 = builder.newAtom(Elements.CARBON);
atomC5.setID("C5"); atomC5.setHydrogenCount(1);

IBond bondB0 = builder.newBond(atomC0, atomC1);
IBond bondB1 = builder.newBond(atomC1, atomC2);
IBond bondB2 = builder.newBond(atomC2, atomC3);
IBond bondB3 = builder.newBond(atomC3, atomC4);
IBond bondB4 = builder.newBond(atomC4, atomC5);
IBond bondB5 = builder.newBond(atomC0, atomC5);

IBond bondingSystem = builder.newBond();
new IAtom[] { atomC0, atomC1, atomC2,
atomC3, atomC4, atomC5}

benzene.addAtom(atomC0); benzene.addAtom(atomC1);
benzene.addAtom(atomC2); benzene.addAtom(atomC3);
benzene.addAtom(atomC4); benzene.addAtom(atomC5);

benzene.addBond(bondB0); benzene.addBond(bondB1);
benzene.addBond(bondB2); benzene.addBond(bondB3);
benzene.addBond(bondB4); benzene.addBond(bondB5);

return benzene;

This version represents the delocalized aromatic pi-system as one IBond:
one with 6 electrons, and 6 associated atoms.

The cyclopentadienyl anion is represented similarly:

public IMolecule makeCycloPentadienylAnion() {
IMolecule cp = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);

IBond bondB0 = builder.newBond(atomC0, atomC1);
IBond bondB1 = builder.newBond(atomC1, atomC2);
IBond bondB2 = builder.newBond(atomC2, atomC3);
IBond bondB3 = builder.newBond(atomC3, atomC4);
IBond bondB4 = builder.newBond(atomC4, atomC0);

IBond bondingSystem = builder.newBond();
new IAtom[]{ atomC0, atomC1, atomC2, atomC3, atomC4}

cp.addAtom(atomC0); cp.addAtom(atomC1);
cp.addAtom(atomC2); cp.addAtom(atomC3);

cp.addBond(bondB0); cp.addBond(bondB1);
cp.addBond(bondB2); cp.addBond(bondB3);
cp.addBond(bondB4); cp.addBond(bondingSystem);

return cp;

And the final step in this series, is ferrocene:

public IMolecule makeFerrocene() {
IMolecule ferrocene = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);
IAtom atomC5 = builder.newAtom(Elements.CARBON);
atomC5.setID("C5"); atomC5.setHydrogenCount(1);
IAtom atomC6 = builder.newAtom(Elements.CARBON);
atomC6.setID("C6"); atomC6.setHydrogenCount(1);
IAtom atomC7 = builder.newAtom(Elements.CARBON);
atomC7.setID("C7"); atomC7.setHydrogenCount(1);
IAtom atomC8 = builder.newAtom(Elements.CARBON);
atomC8.setID("C8"); atomC8.setHydrogenCount(1);
IAtom atomC9 = builder.newAtom(Elements.CARBON);
atomC9.setID("C9"); atomC9.setHydrogenCount(1);
IAtom iron = builder.newAtom(Elements.IRON);
iron.setID("Fe10"); iron.setHydrogenCount(0);

IBond bondB0 = builder.newBond(atomC0, atomC1);
IBond bondB1 = builder.newBond(atomC1, atomC2);
IBond bondB2 = builder.newBond(atomC2, atomC3);
IBond bondB3 = builder.newBond(atomC3, atomC4);
IBond bondB4 = builder.newBond(atomC4, atomC0);
IBond bondB5 = builder.newBond(atomC5, atomC6);
IBond bondB6 = builder.newBond(atomC6, atomC7);
IBond bondB7 = builder.newBond(atomC7, atomC8);
IBond bondB8 = builder.newBond(atomC8, atomC9);
IBond bondB9 = builder.newBond(atomC9, atomC5);

IBond bondingSystem1 = builder.newBond();
new IAtom[] {
atomC0, atomC1, atomC2, atomC3, atomC4, iron
IBond bondingSystem2 = builder.newBond();
new IAtom[] {
atomC5, atomC6, atomC7, atomC8, atomC9, iron
IBond bondingSystem3 = builder.newBond();
new IAtom[]{
atomC0, atomC1, atomC2, atomC3, atomC4,
atomC5, atomC6, atomC7, atomC8, atomC9,

ferrocene.addAtom(atomC0); ferrocene.addAtom(atomC1);
ferrocene.addAtom(atomC2); ferrocene.addAtom(atomC3);
ferrocene.addAtom(atomC4); ferrocene.addAtom(atomC5);
ferrocene.addAtom(atomC6); ferrocene.addAtom(atomC7);
ferrocene.addAtom(atomC8); ferrocene.addAtom(atomC9);

ferrocene.addBond(bondB0); ferrocene.addBond(bondB1);
ferrocene.addBond(bondB2); ferrocene.addBond(bondB3);
ferrocene.addBond(bondB5); ferrocene.addBond(bondB6);
ferrocene.addBond(bondB7); ferrocene.addBond(bondB8);

return ferrocene;

Now, you will note that this approach does not exactly follow Rich's FlexMol examples: the skipped atom pair concepts in the FlexMol version of ferrocene. His example, more closely follows what we are likely to draw, while the CDK code above more closely follows the molecular orbital concept. (I have to check to see how Dietz and Bauerschmidt did this.)

As said, the real trick is to have the chemoinformatics toolkit that can work with this representation, but I will save that for later. At least our algorithms to calculate the molecular mass should work ;)

Thursday, December 21, 2006

Updated Chemical Blogspace Layout and Software

Last night I upgraded the software behind Chemical blogspace, to the version online on Google Code, though I needed the help from Eaun to get paper titles correctly picked up for ACS journals. The number of working blogs is a bit down and now at 68, with an average number of 30 active blogs posting more than 100 blog items each day (see Zeitgeist). The new design looks like quite nice compared to the old one:

Tuesday, December 19, 2006

Chemistry in HTML: Greasemonkey again

Here's a quick update on my blog about SMILES, CAS and InChI in blogs: Greasemonkey last sunday. The original download was messed up :( You can download a new version at

This new version also supports "chem:compound", for any chemical. For example:
  • isopropyl alcohol

Remember that it only works for properly marked up content, as described in Including SMILES, CML and InChI in blogs. The HTML source code of the above example looks like (in RDFa):

<span xmlns:chem=""
class="chem:compound">isopropyl alcohol</span>

The current script only adds search links to PubChem and Google, but the possibilities are endless, and potentially very powerfull. Here are some future ideas.

A link to predict NMR spectra using

Making a link to the website to predict
13C or 1H NMR from a SMILES, and InChI likely too, is easy, if the website provides a URL to do this. (I will discuss this with Stefan.)

A popup window with the 3D structure in Jmol:

This would involve some more work, but this most certainly possible too, given that we actually have a website around which allows downloading 3D coordinates given a SMILES or InChI. While a simple approach would be to make a popup with Jmol that takes the URL to that 3D coordinate website, it could be extended using Ajax to query the 3D structure first, and depending on success, show Jmol or a message "Could not find 3D coordinates".

Summarize molecular details hidden in CML:

This is likely the most exiting possibility. I blogged about CMLRSS many times now (check the AVI, the article, etc), and combining these two technologies will take the semantic, chemistry internet to the next level. CMLRSS describes how CML can be embedded in blog items (e.g. Blogging chemistry on, but really works for any XHTML.

Consider this mockup: add CML content to your blog item, containing molecular properties, such as it's NMR peaks, elemental analysis, etc. This will not show up in your blog item, so that the user is not bothered with implementation details. Now, a userscript will now about the CML content, as it has access to the whole content of the page. The visible text will mention the molecule for which CML contains experimental or other details. Using the <span class="chem:compound"/> technology shown above, it is possible to link that compound to this CML bit (details to follow in this blog in January 2007). The userscript will then on the fly create a popup for the compound name in the visible text to show those experimental details.

How about that? Comments and other ideas are more than welcome!

Server side scripts:

Greasemonkey allows users to decide which scripts to run on a website, and which not. If you, as blogger or XHTML editor, want to force a script like the above to be run, that should be possible too. Greasemonkey scripts are written in JavaScript, so including them on the server side should be possible too. I might explore this option soon too.

Sunday, December 17, 2006

SMILES, CAS and InChI in blogs: Greasemonkey

As follow up on my Including SMILES, CML and InChI in blogs blog last week, I had a go at Greasemonkey. Some time ago already, Flags and Lollipops and Nodalpoint showed with two cool mashups (one Connotea/Postgenomic and one Pubmed/Postgenomic) that user scripts are rather useful in science too. I can very much recommend the PubMed/Postgenomic mashup, as PubMed has several organic chemistry journals indexed too!

So, how does this relate to my blog of last week? Well, would it not be nice that if your blog uses the markup as suggested in that blog, that you automatically get links to PubChem and Google? That is now possible with a small GPL-ed Greasemonkey script called blogchemistry.user.js.

The Greasemonkey plugin requires Firefox to be installed. If ready, install the script by clicking this link earlier, and the Greasemonkey will ask you if you want to install the script. After, check the output for this RDFa markup content:
  • a CAS registry number: 50-00-0
  • and an InChI: InChI=1/CH4/h1H4

It should look like the output for this blog item:

Note the superscript PubChem and Google links.

Update: there was something wrong with the download, which I just fixed (19th, at 8:45 CET). Please download once more to get it working properly.

Counting constitutional isomers from the molecular formula

We all know the combinatorial explosion when calculating the number of possible constitutional isomers (see wp:structural isomorphism) of a certain molecular formula. For example, C2H6 has only one constitutional isomer (ethane, InChI=1/C2H6/c1-2/h1-2H3), and C4H10 has only two. Especially, breaking symmetry by replacing one carbon by another element, or replacing a single by a double bond, increases the number sharply. For example, C7H16 has only nine constitutional isomers, while replacing two single bonds by two double bonds, creating C7H10, increases this number to 499! Then, replacing in the last formula, one carbon by an oxygen adds another few, totaling 747 isomers.

Now, C8H8NBr has at least 649 thousand constitutional isomers, and I am quite interested in being able to know the number of isomers beforehand, without having to generate the structures itself (for example, using CDK's GENMDeterministicGenerator). InChI=1/C8H8BrN/c9-7-1-2-8-6(5-7)3-4-10-8/h1-2,5,10H,3-4H2 is one of the isomers.

So, my question: is anyone aware of free code (in order of preference: 1. LGPL, 2. BSD/MIT, 3. opensource, 4. free) to calculate or estimate the number of constitutional isomers for a certain molecular formula. An estimate would already be nice. Ideally, I would implement this bit of code into the CDK, but otherwise, just knowing the number of isomers for C8H8NBr would be nice :)

Additionally, any relevant, recent literature recommendations are most welcomed. I am aware of the use of polynomials, but literature I have seen so far just focuses on molecules of a certain architecture, and it not able to come up with a guess based on the molecular formula alone.

Tuesday, December 12, 2006

Molecular Chemometrics

I just found out that a review article that I wrote earlier this year got printed: Molecular Chemometrics (DOI: 10.1080/10408340600969601), with my personal view on the interplay between chemoinformatics and chemometrics. The review discusses interesting developments in the last five years, and was fun writing (reading too, I think :). It has four major topics:
  • molecular representation (with 'molecular descriptors' and 'beyond the molecule')
  • chemical space, similarity and diversity
  • activity and property modeling (with 'dimension reduction' and 'model validation')
  • library searching, which mostly focuses on semantic web developments

Comments most welcome; just leave them below this blog item, or blog about the article yourself :)

Sunday, December 10, 2006

Including SMILES, CML and InChI in blogs

The blogs ChemBark and KinasePro have been discussing the use of SMILES, CML and InChI in Chemical Blogspace (with 70 chemistry blogs now!). Chemists seem to prefer SMILES over InChI, while there is interest in moving towards CML too. Peter commented.

Any incorporation of content other than images and free text requires some HTML knowledge, but this can be rather limited. It is up to us chemoinformaticians to write good documentation on how to do things; so here is a first go.

Including CML in blogs and other RSS feeds

I blogged about including CML in blogs last February, and can generally refer to this article published last year: Chemical markup, XML, and the World Wide Web. 5. Applications of chemical metadata in RSS aggregators (PMID:15032525, DOI:10.1021/ci034244p). Basically, it just comes down to putting the CML code into the HTML version of your blog content, though I appreciate the need for plugins.

Including SMILES, CAS and InChI in blogs

Including SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.

Now, users of know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into (which was put on lower priority because of finishing my PhD). basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of <span class="chemicalcompound">asperin.

And this is the way SMILES, CAS and InChI's can be tagged on blogs. The <span> element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:
  • for SMILES: <span class="smiles">CCO</span>
  • for CAS registry numbers: <span class="casnumber">50-00-0</span>
  • for InChI: <span class="inchi">InChI=1/CH4/h1H4</span>

The RDFa alternative

The future, however, might use RDFa over microformats, so here are the RDFa equivalents:
  • for SMILES: <span class="chem:smiles">CCO</span>
  • for CAS registry numbers: <span class="chem:casnumber">50-00-0</span>
  • for InChI: <span class="chem:inchi">InChI=1/CH4/h1H4</span>

which requires you to register the namespace xmlns:chem="" somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.

Saturday, December 09, 2006

H-index in chemoinformatics

Peter blogged about the h-index, which is a measure for ones scientific impact. He used Google Scholar, but I do not feel that that database is clean enough. I believe a better source would be the ISI Web-of-Science.

Therefore, I composed a list of h-indices of my own, ordered by value. The choice of authors is biased to the Blue Obelisk and the CDK, has some personal touches (Buydens are Wehrens are my PhD supervisors) and some names that put the rest into perspective:


Of course, there are many comments on this. Like any measurement, take into account the error. Sources of error include, but are not limited to, ambiguity in the query. The most notable example of this, I think, is Andreas Bender; I don't think he has been that successful :) Also, Rajarshi Guha's h-index was reported 6, but the list included two articles from the 70-ies and 80-ies, which I do not think are actually really his.

Feel free to suggest other names, query corrections, tips, and I will add or work on those too.

Wednesday, December 06, 2006

Chemo::Blogs #2

Because no one picked up my Chemo::Blogs suggestion, I will now officially claim the blog series title. However, unlike the original Bio::Blogs series, I will not summarize interesting blogs, but just spam you with websites I recently marked as toblog on

Semantics and Text Mining

Evan Prodromou wrote about RDFa vs microformats. The latter are commonly used in enhancing blog semantics, and for example used by While RDFa is more explicit, e.g. by using namespaced markup, we have to wait until XHTML2 to see it working. I do not think chemists are using tags a log yet, but let me propose the following microformats: <span class="inchi">1/CH4/h1H4</span> and <span class="chemicalcompound">methane<span>. Standard JavaScripts and CSS scripts will then do the rest. (Think: addressing newlines, auto googling-for-inchi, etc).

The reason why using microformats is interesting, is text mining, of various kinds. Whether it is setting up a molecule-article link database, or find hot molecules in blogspace, adding semantics will help tools like OSCAR3 to mine chemistry. Some time ago OTMI was proposed by Nature, and they now set up a dedicated web site to explain there view on text mining. Zack Rosen has a good idea why RDF Semantic web research isn't working


There are a few new chemistry blogs I want to mention (and already added to Chemical blogspace): ChemBark, lirico which has an interesting chemoinformatics section, and The Curious Wavefunction. Worth reading indeed.

Pierre's YOKOFAKUN deserves a paragraph of his own. He recently blogged about bio2rdf which provides an RDF interface to biochemical knowledge via Life Science Identifiers (LSID), OBOEdit which is a Java based ontology editor, and Amadea which is a Taverna and KNIME like tool for setting up UNIX pipes.

Online EMBL Symposium

A few EMBL PhD students are having the First Online EMBL PhD Symposium (catchy name, or ... ;) Anyway, discussions are held on IRC, and it has a rather interesting Web2.0 session. All media is available on the website but requires registration right now. After the conference it will become open access to all. Jean-Claude contributed The UsefulChem Project: Open Source Chemistry Research using Blogs and Wikis to the Participants' Contributions section, and I did have a poster on Distributing molecular information over the Internet, discussing CMLRSS, blog aggregators, CML and other things. The IRC session was logged and is available here.


Finally, I want to mention three recent articles. First one is a recent write up by Bourne and Friedberg about Ten Simple Rules for Selecting a Postdoctoral Position (DOI: 10.1371/journal.pcbi.0020121). With the end of my current postdoc position nearing, rather useful reading. Some time ago I blogged about a New open access journal Source Code for Biology and Medicine, and the journal is now up and running. Details can be read in the first editorial (DOI: 10.1186/1751-0473-1-1). The third article I would like to mention is Scientific Software Development Is Not an Oxymoron by Baxter (DOI: 10.1371/journal.pcbi.0020087), though I do not think it has new insights.

OK, this was a rather lengthy write up, but really needed to clean up my toblog section :)

The power of big numbers

Contributions to open data do not have to be large, as long as many people are doing it. The Wikipedia is a good example, and PubChem accepts contributions of small databases too (I think). The result can still be large and rather useful, even scientifically.

The latter was recently written down in the paper Internet-based monitoring of influenza-like illness (ILI) in the general population of the Netherlands during the 2003–2004 influenza season by Marquet et al. (DOI: 1471-2458/6/242). The data was provided by Internet users via The Great Influenza Survey website. The article states that the sum of all those small contributions (anonymous website users are asked to fill out a weekly form), yields reliable data. The user is rewarded by colorful pictures, such as:

If all chemists and biochemists would add information about or properties of one molecule or metabolite to the Wikipedia each month, one or more commercial database companies will have to change their business model soon. Oh, you already can start doing this here.

Tuesday, November 28, 2006

Code coverage: making sure your code is tested

Recently I discussed JUnit testing from within Eclipse, and blogged at several occasions about it in other situations. I cannot stress enough how useful unit testing is: it adds this extra set of eyeballs to make bugs shallow. And it does that, indeed.

Ensuring that you actually test all the code you write, however, is not easy. A couple of years back I read an article about Hansel, which does code coverage checking, but never got it nicely working for the CDK project. Never looked at that lately, so no idea how the current release would work out. Hansel is an extension of JUnit, and requires hard coding class names, which conflicts with CDK's module setup.

Thomas Kuhn pointed me last week to Emma, which seems a nice tool. It does not require hacking our source, and generates cool HTML:

And even highlights the source code:

BTW, I seem to be in good company: Classpath is using it too.

Below is the command I issued to generate the HTML output. Rajarshi, maybe this can be integrated into Nightly? Note that it only runs the tests for the data module:

ant dist-large dist-test-large
java -cp ~/tmp/emma-2.0.5312/lib/emma.jar emmarun -cp develjar/junit.jar:dist/jar/cdk-svn-20061128.jar:dist/jar/cdk-test-svn-20061128.jar -r html -sp src junit.textui.TestRunner org.openscience.cdk.test.MdataTest

Tuesday, November 14, 2006

German Conference on Chemoinformatics 2006: Day 3

Just some short quites note about the third day (see day 1 and 2). Today's program of the German Conference on Chemoinformatics started with a presentation by Rzepa about his work on a semantic wiki (DOI:10.1021/ci060139e), which might be online here. (He recorded a podcast, but I have not seen it online yet.) I wish I could see the sources of those wiki pages, to see how that system integrates RDF, but at least Jmol is running fine. The presentation by Couch showed the status of the Materials Grid project, and how a guy called AgentX does all the hard work. Ihlenfeldt updated us about the status of PubChem, and mostly on what they had to do to keep the system from dying from its own success, for example using something called minimol. Googling does not seem to help, as that points to a number of things, but not any PubChem webpage. I am still waiting for a European organization to set up a mirror.

After the coffee break, Kuhn showed a coarse grained force field, approximating molecules by hacking them up in fragment of 3-10 heavy atoms. I guess, a bit like some small molecules force fields do for methyls. Fragments within a molecule are tied together by springs, and intra- and intermolecular force field parameters by running MD runs on fragment pairs. Varnek argued that QSPR for melting point prediction has reached a fundamental limited, with an RMSE of around 30 to 40 degrees Celsius, which makes it quite unreasonable to decide whether a compound with a predicted melting point of 40 degrees is solid or fluid at room temperature.

You have to forgive me for not reporting on the afternoon session; I was tied up talking with people at our booth, talking about the CDK, Taverna, Bioclipse, Jmol, other opensource chemoinformatics tools, and chemoinformatics in general. Very nice, but exhausting. I might advise the organization to set up a blog aggregator next year, though I am not sure whether there are others blogging about this conference.

Monday, November 13, 2006

German Conference on Chemoinformatics 2006: Day 1 and 2

The 2nd German Conference on Chemoinformatics started yesterday, with two chemoinformatics tutorials: one on industrial chemoinformatics (I saw this presentation before... not sure when), with a good overview on integrating different information sources; the second one was about opensource chemoinformatics by Christoph Steinbeck (being involved in opensource chemoinformatics for almost 10 years now!), which included a Bioclipse demo (by me) and a demo by Thomas Kuhn on the CDK based chemoinformatics plugin to Taverna. Other opensource projects of the Blue Obelisk movement were mentioned and a few outside it too.

The conference is in honor of the life work by Prof. Gasteiger, who gave an overview of chemoinformatics in his group, Germany and Europe. He stressed the need of education in chemoinformatics, like in Obernai. He also highlighted that we, today, are still solving the same problem as 30 years ago. Which is true, which is why this channel is called Chem-bla-ics, trying to solve that problem. When asked if opensource chemoinformatics form the start would have addressed this, he replied that he requires people to cooperatively do research with his group; opensource clearly cannot enforce that.

Day 2

Todays program had a number of interesting presentations (I, unfortunately, missed the first presentation, so have to visit that group soon now, to make up for that.) Prof. Aires-de-Sousa showed his work on MOLMAP for mapping metabolic networks (KEGG really, see my earlier blog), and showed, just as proof of principle, classification of organisms based on this.

J. Weisser talked about docking, still an obligatory topic. This work really showed two new approaches: the use of QM partial charges (the example showed an improvement in RMSD of a factor 10, not very statistical, but promising indeed); the second was the fact that water does not like to be in tight spots, because of reduced possibilities for hydrogen bonding. A concept common in understand supramolecular phenomenon, but I have not seen this applied to docking before. But I am no expert in that field. M. Wagner showed work on using KEGG data to estimate likely metabolites, and the use in reducing effects of metabolic degradation. T. Schroeter introduced me to gaussian processes, a new data modeling method. Quite embarrassing to get introduced to such, as being specialized in modeling methods for chemical problems.

The poster session was, as normally, really exhausting, talking to a lot of people. Having a booth at the exhibition on opensource chemoinformatics added a nice twist to this. I therefore skipped the FIZ-award winner lectures, so I hope someone else will blog about those.

One last note: Sun started releasing their Java platform under the GPL license. Jim, seems that they proved me wrong. The class library is still not GPL, but is expected to become licensed such somewhere in the first half of next year.

Sunday, November 12, 2006

Organic chemists can now tune properties without changing the molecular structure??

Paul Bracher and Joshua Finkelstein pointed my attention to a nice discussion in Nature on the future of chemistry, in What Chemists Want to Know, by Philip Ball. Paul and Joshua already reviewed it thoroughly, but I could not resist commenting in it too. Having chosen chemistry as specialization when I went to university, and with a minor in supramolecular chemistry, this is a something I do relate to.

A main theme is whether chemistry is unexplored enough to justify further academic research and education. Ball's answer is yes, and came up with a six questions, of which I found this one most intriguing: what is the chemical basis of thought and memory. But the article interestingly also discusses if chemistry has not become a tool for more interesting fields of research. The Nobel prize winners Ball interviewed do not think so.

One quote took my surprise: Where is synthetic astronomy - changing the gravitational constant to see what effect that has on the properties of the Universe, and thus perhaps improving it? Well, I might be out of the synthetic organic chemistry for too long now, but this is not a quote I would like to be in Nature with; is synthetic chemistry now able, then, to modify the nature, strengths of bonds now?? can they actually change molecular properties without changing the connectivity?? Moreover, astronomers have changed the properties of objects in our universe: since years they have been reducing the mass of the earth by sending of probes to other objects (satellites etc). Likewise, chemistry is not changing nature, it is just exploring all compounds we never had purified in our glassware yet. Synthesis is nowhere like changing nature.

There is one other comment I would like to post here. I strongly agree that chemistry in itself is important to have as separate educational and research topic at universities. Simply because too databases are, from a chemical point of view, messed up. For example, KEGG and the PDB are know to have many chemical errors, though these databases are rather important indeed. We need people around to educate people and point out those errors, if life sciences itself is to have a future.

Tuesday, November 07, 2006

When is open source chemoinformatics successfull?

Open source chemoinformatics has become a common phenomenon, though many projects are small in nature: source code is developed by only few developers, or even in a closed manner and released when considered done. Within open source software there is room for distinguishing a subset of open development chemoinformatics, that is, Bazar-like, instead of Cathedral-like (see ESR famous writing).

Measuring the importance of an open source project can be done by many measures, such as the number of people on the user and developers mailing lists, number of downloads, number of source lines of code [wp:SLOC], number of independent development locations, and rankings on, for example, SourceForge or Google. Just to name a few.

Scientific importance of an open source project can sometimes be measured by a citation index; that is, only when there is a landmark article for the project. Rasmol is such a project: a first article was published in 1995 (DOI:10.1016/S0968-0004(00)89080-5), and a follow up in 2000 (DOI:10.1016/S0968-0004(00)01606-6). The first was cited 1190 times, and the second 65 times (as stated on Web-of-Science). Quite successful indeed.

OK, it is not even 100+, but I am quite happy with the scientific impact of the CDK so far: the 2003 CDK article (DOI:10.1021/ci025584y) was cited 24 times now, and the just published 2006 article (DOI:10.2174/138161206777585274) once:

Friday, November 03, 2006

Chemical Blogspace updates

Chemical Blogspace is up and running fine for some time now. Since the start the number of aggregated blogs increased from 19 to 64 now, of which a number are situated at ChemBlogs which is a site where you can run a blog. Meanwhile, the number of cited papers went up to 186! The JACS is most popular so far, followed by the Angewandte Chemie Int. Ed.

As mentioned before, the software was taken, which has upgraded considerably and released new software since the author moved to Nature, but I have not found time to follow that upgrade yet :( The promised InChI support is still pending too.

Bioclipse Workshop: short but productive

The Bioclipse Workshop has ended and, for just three days, turned out quite productive. We have first bits of scripting support for JavaScript using Rhino. At this moment the scripting plugin needs to explicit depend on plugins to be able to access their classpath, but we plan to solve that. An example script:
// to have short identifiers
Array =;
String =;
msgBox =;
DbfetchServiceServiceLocator =;

// get data
service = new DbfetchServiceServiceLocator();
strarray = service.getUrnDbfetch().fetchData("refseq:NM_210721", "refseq", "raw");

// make readable
str = new String();
for (i = 0; i < Array.getLength(strarray); i++) {
if (i != 0)
str = str + ("\n");
str = str + strarray[i];

// show

It's just a short example that uses webservice technology in Bioclipse to fetch a sequence.

QSAR support

QSAR support is getting along too, with a new DescriptorProvider extension point in trunk/ and work is progressing on a wizard that allows selecting descriptors and a CDK backend. The output of the wizard is a matrix resource, for which we already have a rich editor. A JOELib plugin has been suggested, as it has a good deal of QSAR descriptors too; Jörg, interested in doing a tiny bit of Bioclipse hacking?

A full proceedings is available online.

Wednesday, November 01, 2006

The Bioclipse Workshop is in progress

The Bioclipse Workshop is in progress, and Ola is now leading a discussion about future releases and functionality. Proceedings are live updated, and presentation sheets will be available shortly.

Saturday, October 28, 2006

Opensource Chemistry and Opensource Chemoinformatics

The Blue Obelisk mailing list has seen an interesting discussion on ambiguity in the term 'open source', triggered by a study by Beth Ritter Guth. For example, Jean-Claude Bradley performs 'open source' science (see his Useful Chemistry blog) who is not opposed to using closed source software, while the Blue Obelisk is about 'open source' software. It seemed that this was contradicting, and Peter Murray-Rust [wp:en] wrote up a lengthy overview of the use of the term 'open'.

Now, I have been giving the 'open source' ambiguity some thinking (well, about a month or so...), and came to the following conclusions:

  1. open source has the exact same meaning in both Bradley-like open source chemistry, and BO-like open source chemoinformatics
  2. both have the same goal
  3. it's just the research topic that is different

Ad 1: same meaning of 'open source'

I think 'open source' just means that every has the right to reproduce (and distribute and the same or modified shape) products created from the source.

In 'open source chemistry' (Bradley-like, sorry for the term :) the source is are the details about the chemical reactions to perform, the product being being able to run the whole reaction pathway.

In 'open source chemoinformatics' (Blue Obelisk-like) the source is the procedure that described how to get from one set of bits to another, really quite like getting from one molecule to another. Chemoinformatics, being IT science, just makes it a lot easier to distribute the algorithm to do that. (Sure, CMLReact is getting along quite nicely.)

The analogy even goes further, both science do not only depend on open source. Like Bradley-like open source science allows embedding proprietary stuff (glass-ware, closed-source software, chemical both from Acros (now Fisher), ...), so does BO-like open source science, which uses tons of proprietary stuff too (computers, Sun's JVM, MS-Windows).

Ad 2: same goal

I can be short on this one. For both 'open source' initiatives the goal is to share knowledge and make science reproducible.

Ad 3: different topic

So, the confusion was just coming from the fact to what extend 'open source' tools are being used. Can you do open source science without using open source chemoinformatics? Sure. In a utopic situation, all tools and small bits are 'open source' (though some are agnostic to this). But fact is, that many Blue Obelisk members use 'closed source' tools all the time, even if they do not have too. At least everyone is doing 'open source' on their specialisms, both in open source chemistry and in open source chemoinformatics.

I guess we should just be stop being short on 'open source software' to remove any ambiguity of the term 'open source'. As a spin-off, this would make Bradley's work fit in nicely with ODOSOS: open data, open source, open standards.

Thursday, October 26, 2006

Running single JUnit tests in Eclipse

Unit testing is important when developing source code. JUnit provides a library to facilitate this in Java, and Eclipse had the functionality to run JUnit tests. Even better, it allows you to run single JUnit tests, even in debug mode:

Just open the java class in your Package Explorer, right click on the JUnit method you want to run, then pick 'Run As' or 'Debug As', and then 'JUnit test'.

Wednesday, October 25, 2006

Being a good opensource user

There are many ways to contribute to opensource software (OSS), programming only being one of them. I develop OSS, but use OSS too. For example, I am a big user of the Linux kernel, the KDE desktop, Kubuntu, Debian (I have unstable in a chroot), Firefox, Eclipse, Classpath, and many, many others. What these have in common, is that I generally have no time to look into the source code of these projects. A small patch excluded, I am really a regular user of these projects.

However, I try not to leech (see also Peter's related comment on that): I care about these projects and, therefore, I file bug reports. Sometimes, I even join the developers and talk to them via commonly used IRC and mailing lists. Even, every now and then I get this itch and then I do look up source code and contribute a patch. But filing bug reports is the least one can do, the least everyone should do.


Classpath is the GNU project to provide a free Java library, i.e. the set of java.* classes that come with the Sun JVM. It is not a virtual machine, though, for which several opensource implementations are available, many of which use Classpath as library provider. They have a very nice chat channel at, called #classpath. There wiki provides a platform for given feedback on how well software runs. A bug track system (BTS) is available too. An overview of the bugs that I filed, can be found at my account: bugreports+Classpath.

Needless to say, Classpath is important in making our Java based chemoinformatics truely opensource.


Things are different for Debian and Kubuntu: these are distributions and, except for some patching, are generally not involved software development as done by upstream. However, they generally do appreciate to know about bugs too, so there is some duplication of bug reports here.

That said, they do provide nice tools for bug reporting which works for all packages that they distribute. Debian has reportbug and Kubuntu has Launchpad. An over view of bugs I reported with Debian can be found at bugreports+debian. I do not have bug reports in Launchpad yet, but two can be found in mailing list archives, see bugreports+ubuntu.


I also tracked back two bugs I reported with KDE, see bugreports+KDE.


Surely, I filed many more bugs to many other projects. A long list of bug reports can be found on SourceForge. However, it seems not possible to make an easy list of that :(

Wednesday, October 11, 2006

Are chemogenomics and proteochemometrics the same?

Joerg Wegner recently blogged about Chemogenomics: structuring the drug discovery process to gene families by C.J. Harris and A. P. Stevens in Drug Discov Today (DOI: 10.1016/j.drudis.2006.08.013). This review article provides a nice overview of a trend in mathematical modelling of the interaction of small organic molecules with proteins, often referred to as QSAR. What the article does not discuss, is the work by the group of Jarl Wikberg who coined the term proteochemometrics (see PubMed: 11342268).

Friday, October 06, 2006

Google's new search engine: /* Code Search */

Google has set up a new search enginge specifically for source code: /* Code Search */. Important difference with their normal search engine is that it allows restricting your search by programming language, license and filename and package. I have not been able to figure out how to use 'package' yet, but the others are pretty clear. For example: AtomContainer license:LGPL lang:java should do it. The search results show filenames, licenses and programming languages:

Alternatively, you can use Koders, which is a source code search engine too. It has been around for quite some time now, and shows the copyright notice too. Additionally, Koders offers a plugin for Eclipse which adds a search 'view' which will show the HTML from the website in an editor window inside Eclipse.

Wednesday, October 04, 2006

Bioinformatics: Open Source or Open Access??

I have heard that bioinformatics is ahead of chemoinformatics. However, I discoverd that this is not necessarily the case, while preparing for a homology modeling course I gave this week at the CUBIC. Open Access is really no issue there, with open access journals and many open access databases. But it is different when it comes down to open source software.

Below is a list of bioinformatics programs which are free for academic use, but not open: And this not even includes the many websites which do not offer the software behind them. And these programs cover several steps in the whole homology modeling process. Open source homology modeling is not possible at this moment :(

But, on the bright side, there are already some open source programs involved too: And protein structure viewers is hardly a problem at all; several open source viewers are available, among which Rasmol, PyMOL and Jmol.

In other words: we might not want to look at bioinformatics too much.

Thursday, September 28, 2006

CompLife'06 - Day 1

CompLife'06 started today in Cambridge, UK. About 80 people are attending the meeting, and topics range from systems biology to QSAR. This evening there was a free software session mostly focussing on opensource software. Twelve projects were presented, among which the CDK (by me) and Bioclipse (by Ola), in five minute presentations, and a two hour demo period during a reception (free speech and free beer :). We had our brand new fliers with us, as well as a large poster for some additional branding.

One research presentation compared a number of fingerprint implementations in a QSAR study, and CDK came out very well, beating a few commercial programs. The free software session was full of CDK, however, with AMBIT, iBabel, Bioclipse and KNIME mentioning the CDK.

The latter is really interesting: it's a workflow program just like Taverna or PipeLine Pilot, which is using the Eclipse RCP as starting point, just like Bioclipse. And like the other two, KNIME has CDK integration, at least for displaying structures.

Sunday, September 24, 2006

CDK Bug Squash Party - Day 5

Day 5 was formally the last day (see also the summaries of day 1, day 2 and day 3/4) of the Chemistry Development Kit Bug Squash Party (BSP). Miguel uploaded the last bits of his CDK PDBPolymer to CML to CDK PDBPolymer roundtripping functionality (closing a bug and a feature request in one go). Have not tested this first hand yet, but looking forward to playing with this bit of code. Kia continued to work on the more difficult bits of the code refactoring, resulting in fewer though more comprehensive commits. Stefan fixed another bug in JChemPaint; the rendering of implicit hydrogens.

About the last, the Renderer2D needs a serious overhaul. That is, a complete rewrite in proper Java2D, which can use affine transformations for zooming, scaling and fixing the coordinate system. The current code is ancient and predates Java2D. Rich' code might be a good starting point. I would love to do this rewrite, but lack the resources... anyone in need of some open source fame?

I worked on atom typing, which is yet largely untested, and often integrated with other bits of code. Yesterday I uploaded some first patches which I wrote on the train ride back to the Netherlands.

Now, what can be concluded from this BSP? The participant count was below what I had hoped for, but those who did worked hard (and with pleasure I hope :) The total number of JUnit test has increased:
And so has the number of failing tests:

These plots were made with R from data created with
two custom scripts both found in cdk/tools: and extractBugCountPlotData.bsh. Note that 96.86% of the tests do not fail!

The bump in failing tests seems to be due to commit 7010-7011, which has to do with SMILES parsing. Yes, the bond order resolving is still not solved. I don't seem to get Todd's patch for this working, but not giving up either. The bump is so large, because quite some JUnit tests use the SmilesParser as a quick tool to get a configured connection table. However, these tests should be replaced by explicit CDK models, which is easy done with the CDKSourceCodeWriter. I'll blog about how to use that soon.

Friday, September 22, 2006

CDK Bug Squash Party - Day 3 and 4

Because I was struggling hard with default values for cdk.interfaces fields, I did not have time to write up the Bug Squash Party report for day 3 (see also day 1 and day 2). But here it is.

Day 3

Kai worked hard on getting the cdk.interfaces API cleaned up, as agreed upon earlier. Christian added a test for the RMSD calculator (see getAllAtomRMSD()), and cleaned up his code a bit. Stefan continued his bug-squashing on JChemPaint and fixed another one or two bugs.

Rajarshi uploaded a patch to set undefined atomic properties, like partial and formal charges and the implicit hydrogen count, to UNSET by default. However, this broke the CDK at many places, as apparently many class methods assume the default to be zero. After discussing the issue at the CUBIC, it turned out that this was sort of the intended, though undocumented, behavior: use the default Java values.

And I added missing clone() methods, closing one bug on SourceForge, added files for Eclipse to know how to build the CDK with Ant (thanx to Nico for similar files for Jmol), and got CDK compiled again against Classpath.

Day 4

Miguel uploaded his first patched for support saving PDBPolymer data structures into and restoring them again from CML, addressing an almost two-year-old bug. He created new cdk.interfaces for them, to address module dependencies, but a large set of JUnit tests are yet missing.

Kai continued his cdk.interfaces refactoring, working on the more involved changes. Stefan, Tobias, and me worked on a poster and three three-fold flyers for our CDK booth at CompLife2006, so have not been very productive in bug squashing. But we are happy with the result. Below is a screenshot on one side of the main CDK folder:

With 77 failing JUnit test, and still a too large number of open bugs on SourceForge, there is plenty of things to do today.

Wednesday, September 20, 2006

CDK Bug Squash Party - Day 2

Like yesterday I will give short overview of things done at the Chemistry Development Kit Bug Squash Party (BSP). I think Stefan was the only to fix and close a bug report yesterday. Rajarshi added the MDE descriptor (yes, during a BSP new code might be commited too ;)

More interestingly, discussion on the developers mailing list on the patch by Todd Martin of the EPA to address deducing bond orders in
SMILES parsing (the major source of current open bugs!). A problem seems to be when his tool should be called in the SmilesParser class.

More details on the proceedings can be found on the BSP wiki page.

Monday, September 18, 2006

CDK Bug Squash Party - Day 1

I plan to do a daily coverage of the Chemistry Development Kit Bug Squash Party (BSP). While Stefan was working hard to get the wiki machine back online after a hard-disc crash, Rajarshi, Miguel and me have been working hard. Miguel started to work on missing JUnit tests for bugs reported on SourceForge and Rajarshi fixed PMD, JavaDoc and other problems. I wrote 19 new JUnit tests and fixed two bugs, but with 44 bugs still open at SourceForge, there is quite some work to do. Luckily, several others will join in later this week.

As can be read on the BSP wiki page, there is work for everyone, on every level, and even for non-programmers. Or just stop by on CDK's IRC channel (link works with Konqueror, maybe other browsers too) to see what a BSP looks like from the inside.

Friday, September 15, 2006

Chemo::Blogs #1

There are a number of links I wanted to blog about, but never really had time for yet. Here's a short review of a them. Bio::Blogs is a series of summary/review articles of bio related blogs, and definately worth putting in your aggregator. Maybe someone is interested in setting up a Chemo::Blogs for chemistry blogs?

My (social bookmarking) network informed me about HTML Slidy, an XHTML based PowerPoint replacement. Being true XHTML, it allows embedding Jmol, JChemPaint and any other applet. Embed your pieces of CML, MathML and SVG (or any other namespace) and you no longer have data loss.

Nucleic Acids Research recently had a special issue on webservers (DOI:10.1093/nar/gkl385), in which Taverna was featured (DOI:10.1093/nar/gkl320). Just want to mention once more that Taverna has a chemoinformatics module: CDK-Taverna.

Day and Motherwell published the paper An Experiment in Crystal Structure Prediction by Popular Vote (DOI:10.1021/cg060313r). It links to a openaccess website to participate yourself. This is one way in which one have tigher integration of the internet with old-fashion publishing.

And some minor notes: a video tutorial was put online in this blog that shows how Jmol is inserted on a Moodle page. And, as Pierre reminded me, The Life Sciences Semantic Web is Full of Creeps! (DOI:10.1093/bib/bbl025), which puts me in an identity crisis: hacker, chemist or creep. Mmmm...