Thursday, September 28, 2006

CompLife'06 - Day 1

CompLife'06 started today in Cambridge, UK. About 80 people are attending the meeting, and topics range from systems biology to QSAR. This evening there was a free software session mostly focussing on opensource software. Twelve projects were presented, among which the CDK (by me) and Bioclipse (by Ola), in five minute presentations, and a two hour demo period during a reception (free speech and free beer :). We had our brand new fliers with us, as well as a large poster for some additional branding.

One research presentation compared a number of fingerprint implementations in a QSAR study, and CDK came out very well, beating a few commercial programs. The free software session was full of CDK, however, with AMBIT, iBabel, Bioclipse and KNIME mentioning the CDK.

The latter is really interesting: it's a workflow program just like Taverna or PipeLine Pilot, which is using the Eclipse RCP as starting point, just like Bioclipse. And like the other two, KNIME has CDK integration, at least for displaying structures.

Sunday, September 24, 2006

CDK Bug Squash Party - Day 5

Day 5 was formally the last day (see also the summaries of day 1, day 2 and day 3/4) of the Chemistry Development Kit Bug Squash Party (BSP). Miguel uploaded the last bits of his CDK PDBPolymer to CML to CDK PDBPolymer roundtripping functionality (closing a bug and a feature request in one go). Have not tested this first hand yet, but looking forward to playing with this bit of code. Kia continued to work on the more difficult bits of the code refactoring, resulting in fewer though more comprehensive commits. Stefan fixed another bug in JChemPaint; the rendering of implicit hydrogens.

About the last, the Renderer2D needs a serious overhaul. That is, a complete rewrite in proper Java2D, which can use affine transformations for zooming, scaling and fixing the coordinate system. The current code is ancient and predates Java2D. Rich' code might be a good starting point. I would love to do this rewrite, but lack the resources... anyone in need of some open source fame?

I worked on atom typing, which is yet largely untested, and often integrated with other bits of code. Yesterday I uploaded some first patches which I wrote on the train ride back to the Netherlands.

Now, what can be concluded from this BSP? The participant count was below what I had hoped for, but those who did worked hard (and with pleasure I hope :) The total number of JUnit test has increased:
And so has the number of failing tests:

These plots were made with R from data created with
two custom scripts both found in cdk/tools: and extractBugCountPlotData.bsh. Note that 96.86% of the tests do not fail!

The bump in failing tests seems to be due to commit 7010-7011, which has to do with SMILES parsing. Yes, the bond order resolving is still not solved. I don't seem to get Todd's patch for this working, but not giving up either. The bump is so large, because quite some JUnit tests use the SmilesParser as a quick tool to get a configured connection table. However, these tests should be replaced by explicit CDK models, which is easy done with the CDKSourceCodeWriter. I'll blog about how to use that soon.

Friday, September 22, 2006

CDK Bug Squash Party - Day 3 and 4

Because I was struggling hard with default values for cdk.interfaces fields, I did not have time to write up the Bug Squash Party report for day 3 (see also day 1 and day 2). But here it is.

Day 3

Kai worked hard on getting the cdk.interfaces API cleaned up, as agreed upon earlier. Christian added a test for the RMSD calculator (see getAllAtomRMSD()), and cleaned up his code a bit. Stefan continued his bug-squashing on JChemPaint and fixed another one or two bugs.

Rajarshi uploaded a patch to set undefined atomic properties, like partial and formal charges and the implicit hydrogen count, to UNSET by default. However, this broke the CDK at many places, as apparently many class methods assume the default to be zero. After discussing the issue at the CUBIC, it turned out that this was sort of the intended, though undocumented, behavior: use the default Java values.

And I added missing clone() methods, closing one bug on SourceForge, added files for Eclipse to know how to build the CDK with Ant (thanx to Nico for similar files for Jmol), and got CDK compiled again against Classpath.

Day 4

Miguel uploaded his first patched for support saving PDBPolymer data structures into and restoring them again from CML, addressing an almost two-year-old bug. He created new cdk.interfaces for them, to address module dependencies, but a large set of JUnit tests are yet missing.

Kai continued his cdk.interfaces refactoring, working on the more involved changes. Stefan, Tobias, and me worked on a poster and three three-fold flyers for our CDK booth at CompLife2006, so have not been very productive in bug squashing. But we are happy with the result. Below is a screenshot on one side of the main CDK folder:

With 77 failing JUnit test, and still a too large number of open bugs on SourceForge, there is plenty of things to do today.

Wednesday, September 20, 2006

CDK Bug Squash Party - Day 2

Like yesterday I will give short overview of things done at the Chemistry Development Kit Bug Squash Party (BSP). I think Stefan was the only to fix and close a bug report yesterday. Rajarshi added the MDE descriptor (yes, during a BSP new code might be commited too ;)

More interestingly, discussion on the developers mailing list on the patch by Todd Martin of the EPA to address deducing bond orders in
SMILES parsing (the major source of current open bugs!). A problem seems to be when his tool should be called in the SmilesParser class.

More details on the proceedings can be found on the BSP wiki page.

Monday, September 18, 2006

CDK Bug Squash Party - Day 1

I plan to do a daily coverage of the Chemistry Development Kit Bug Squash Party (BSP). While Stefan was working hard to get the wiki machine back online after a hard-disc crash, Rajarshi, Miguel and me have been working hard. Miguel started to work on missing JUnit tests for bugs reported on SourceForge and Rajarshi fixed PMD, JavaDoc and other problems. I wrote 19 new JUnit tests and fixed two bugs, but with 44 bugs still open at SourceForge, there is quite some work to do. Luckily, several others will join in later this week.

As can be read on the BSP wiki page, there is work for everyone, on every level, and even for non-programmers. Or just stop by on CDK's IRC channel (link works with Konqueror, maybe other browsers too) to see what a BSP looks like from the inside.

Friday, September 15, 2006

Chemo::Blogs #1

There are a number of links I wanted to blog about, but never really had time for yet. Here's a short review of a them. Bio::Blogs is a series of summary/review articles of bio related blogs, and definately worth putting in your aggregator. Maybe someone is interested in setting up a Chemo::Blogs for chemistry blogs?

My (social bookmarking) network informed me about HTML Slidy, an XHTML based PowerPoint replacement. Being true XHTML, it allows embedding Jmol, JChemPaint and any other applet. Embed your pieces of CML, MathML and SVG (or any other namespace) and you no longer have data loss.

Nucleic Acids Research recently had a special issue on webservers (DOI:10.1093/nar/gkl385), in which Taverna was featured (DOI:10.1093/nar/gkl320). Just want to mention once more that Taverna has a chemoinformatics module: CDK-Taverna.

Day and Motherwell published the paper An Experiment in Crystal Structure Prediction by Popular Vote (DOI:10.1021/cg060313r). It links to a openaccess website to participate yourself. This is one way in which one have tigher integration of the internet with old-fashion publishing.

And some minor notes: a video tutorial was put online in this blog that shows how Jmol is inserted on a Moodle page. And, as Pierre reminded me, The Life Sciences Semantic Web is Full of Creeps! (DOI:10.1093/bib/bbl025), which puts me in an identity crisis: hacker, chemist or creep. Mmmm...

Thursday, September 14, 2006

Complex PDB documents using the Bioclipse ChildResourceCreator

Some time ago I blogged about the ChildResourceCreator extension point in Bioclipse and hinted as using that for PDB files. which contain 3D molecular models, sequences and bibliographic information. Using the new extension point, Bioclipse now treats PDB files as complex documents, creating child resources for the 3D molecular model (using the CDK plugin), and a sequence resource (using the BioJava plugin).

Wednesday, September 13, 2006

"Jmol and the CDK add powerful chemical capabilities", says Munos in Nature Reviews Drug Discovery

Bernard Munos at Eli Lilly & Co. wrote up a lengthy analysis on open source in drug discovery in Nature Reviews Drug Discovery: Can open-source R&D reinvigorate drug research? (DOI:10.1038/nrd2131). When scanning the article I saw this quote:

Other tools such as eMolecules, Jmol or the Chemistry Development Kit are adding powerful chemical search and visualization capabilities to the open-source scientist's toolbox.

Unfortunately, the paper does not point to the correct CDK website, but to the CUBIC backend at Moreover, I don't think the quote does full justice to what the CDK has achieved in the past six years; I'm sure we have achieved more than a fingerprinter and some 2D and 3D rendering!

Friday, September 08, 2006

BioJava 1.5 beta released

Martin Szugat reported that a beta for BioJava 1.5 has been released. New features include: a new biojavax package with extension on the basic functionlity, such as the RichSequence.IOTools and the RichSequence object; a genetic algorithm library; features that allow manipulation of 3D structure files and objects; and non-HMM implementations of the NW and SW alignment algorithms. The announcement also mentions a new package for handling external processes (org.biojava.utils.process); I am wondering what that is about. I will upload this beta to Bioclipse trunk/bc_biojava/ shortly, so that we can play with it.

Thursday, September 07, 2006

Chemical Archeology: OSCAR3 to

Chemical Archeology (see Christoph's comment) is the process of extracting chemical information from old journal articles. Some time ago, Peter Corbett from the group of Peter Murray-Rust visited the CUBIC to talk to us about Oscar3 which can do just that. That day, we already hooked OPSIN into Bioclipse.

Oscar3, however, is capable of more then the name2structure of OPSIN (see also 10.1039/b411033a; it can take a plain text file with an experimental section with details on the synthesis of small organic compounds, and analyze the chemistry in that. This functionality has been available as an RSC authoring tool for some time now (see also 10.1039/b411699m). Unfortunately, what publisher put online (PDF and HTML) is much more difficult to process with Oscar3: those formats are often optimized for display, not for machine processing. The HTML can be cleaned up, but there is no general approach.

Christoph Steinbeck is going to present at the upcoming ACS meeting the use of Oscar3 for extraction of NMR spectra from old journal article, in preperation for submission to the (see the abstract of CINF 101).

Since the full Oscar3 was not hooked into Bioclipse yet, I had some work to do. It took me some time to figure out how to properly configure Oscar3, and what additional things I had to do to clean up the HTML used by publishers to get Oscar3 to extract NMR spectra (thanx to PeterC for hints!). I also had to tweak the Oscar3 code itself here and there, but that's what opensource is about :) (Peter, if you are reading this: I have a number of patches for the Oscar3 code in bc_oscar; let me know if you're interested in them.)

This is the end result:

Note especially the hierarchy in the resource navigator on the left. The misc folder contains all the chemistry found in the article. But more importantly is that for six molecules it fully detected he experimental section! For 3-(2-Oxocyclooctanyl)-3-phenylpropan-1-al (InChI=1/C17H22O2/c18-13-12-15(14-8-4-3-5-9-14)16-10-6-1-2-7-11-17(16)19/h3-5,8-9,13,15-16H,1-2,6-7,10-12H2) it derived the molecular structure (with OPSIN), and a few spectra: H-NMR, high-resolution MS and IR.

So, if you attend the ACS meeting: make sure to visit Christoph's CINF 101 presentation!

Update: added missing tags and link to Christoph's comment on the origin of the 'chemical archeology' term.

Saturday, September 02, 2006

Calculating geometrical properties with the CDK

ケムインフォマティクスに虚空投げ runs a story on how to calculate geometrical properties of a 3D structure using CDK's ForceFieldTools. This class contains a few methods to calculate distances between atoms and angles between bonds.

This tools class is special as it uses vecmath GVector objects, which just contain atomic coordinates, likely suitable for extensive computation, as expected in CDK's force field implementation. However, for just calculating the distance and angles, there are simpler alternatives.

The distance between two atoms can be calculated with:

atom1 = molecule.getAtom(0);
atom2 = molecule.getAtom(1);
double dist = atom1.getPoint3d().distance(atom2.getPoint3d());

or, by constructing a vector for the bond first:

Vector3d bond1to2 = new Vector3d(atom2.getPoint3d());
double dist = bond1to2.length();

Using vectors to represent bond (with two atoms!), allows easily calculating angles too (assuming the bonds shard atom1):

double angle = bond1to2.angle(bond1to3);

Vecmath does not seem to contain a convenience method for calculating torsion angles :(