Thursday, October 20, 2011

CDK & File Formats #1: MDL molfiles and bond order 4

I just had a conference call on one of the translational cheminformatics projects I am involved in: Bioclipse-OpenTox. A paper about this project has been submitted, and we are writing up a more practice oriented book chapter (almost done). In writing up a use case, we ran into a recurrent problem: proper cheminformatics handling of input files. Ola suggested to start writing more extensive documentation on what users of the CDK are supposed to do when reading a file. So, here I start a new series.

But before I start writing up how to work with MDL molfiles in the CDK, I like to stress three key design principles in the CDK:
  1. there is no single solution to everything (aka there are multiple solutions to the same problem)
  2. algorithms must be modular (the LEGO building block approach)
  3. the user is responsible for using the right blocks at the right time
These principles have a profound effect on the usability of the CDK. And here I must stress the point I made recently on usability: what happens if you neglect less abundant personas. Well, in this case, the abundancy is actually somewhat different. But, CDK libraries target a few personas: one of these is the scientist that uses cheminformatics as a mere tool, who doesn't know graph theory, let alone file formats; this personas works in the field of translational cheminformatics. Let's call him Tony. The other personas has a more extensive education in cheminformatics (e.g. former Gasteiger lab, Sheffield, former CAOS/CAMM (like me), etc) or are actively involved in cheminformatics research (like that of Christoph, and many, many more). This personas we will call Carry.

The CDK, with its limited resources, must target both Tony as well as Carry; both have widely different needs, and somewhere, in our free hours, we must work on solutions for both personas. Carry and Tony are, of course, exaggerations, and all readers (and me too) are linear combinations of these personas. In fact, you can even be a Carry in some parts of the CDK, while being a Tony in others. The CDK has so much functionality nowadays, that even I have unexplored corners.

And, practically, the building block approach is important to Carry (she might want to plug in her own aromaticity model), but a killer for Tony.

Missing information
A prominent, recurrent problem for the first personas, is to deal with missing information. And input files have missing information. Some formats are more explicit than others. This series will focus on these aspects, and discuss how the CDK can be used to add that missing information.

Another important aspect is that the CDK data model cannot hold all information. For example, I have no clue how the CDK should read MDL files with a muonium (browse the cdk-devel mailing list archives of this month). Here too, this is not always clear to Tony, who does not have the time to read file format specifications, nor CDK interfaces. He just wants the CDK to do its thing which it is supposed to be good at.

MDL formats
So, here we are. We have a MDL .mol file. They are pretty much the community standard, and even pretty Open too. You can now find the specs in the ctfile.pdf, readily available on the web. Actually, they are no longer called MDL formats, but Symyx formats, umm, Accelrys formats. These formats define a number of file formats, including the aforementioned Accelrys molfile, the Symyx SD file, but also query formats, used to store queries against their database software.

Like any file format, they support a number of features. For example, MDL files cannot represent a bond order 4, a quadruple bond. Organic chemistry doesn't need them. Moreover, hydrogens are often implicit, as they can easily be added later, and both memory and disk space is expensive (think 80-ies). Stereochemistry is wedge-bond-based and others.

Well, more recent MDL formats have become more powerful. The V3000 format can do much more then the V2000 format, or even the pre-V2000 format.

Now, the trigger for the start of this series is the bond order 4 in MDL files. Strictly speaking this is not part of the molfile format, nor of the SD file format; instead, it's part of the query format. However, the community ignored that part of the specification, and the molfiles and SD files are commonly using this query type to represent aromatic bonds. Well, or bonds that can be both single or double.

Now, the CDK does not have a structure to represent a single OR a double bond order. The CDK only has SINGLE, DOUBLE, TRIPLE, and QUADRUPLE (see IBond.Order). Moreover, it has a separate mechanism to indicate if a bond is aromatic. A flag is used for that. This allows the CDK to store both kekule-ized bond order localization and aromaticity perception information separately.

So, if a MDL molfile (or SD file) is read with the CDK in RELAXED mode with the MDLV2000Reader, order 4 bonds are read as SINGLE bonds with a flag indicating it is aromatic. If only the CDK had IBond.Order.UNKNOWN. This is scheduled for master. Mind you, this is a complicated patch, because a lot of algorithm introspect the bond order information, which will have to be updated to handle UNKNOWN bond orders.

This is a nasty co-incidence (interaction effect) killing Tony's use case: the MDL molfiles in the wild have information that the CDK cannot represent, sadly. Now, if users would have been paying for CDK releases, we could have assigned a developer on it. We have the mechanisms in place to buy a copy of the CDK, so that's not really an excuse (price negotiable; maybe 4999 SEK, but you are welcome to buy a campus-wide CDK version for more). If enough people / companies would do this, we could hire a developer to implement this particular use case. This would be a great way to support the project!

Anyway, the purpose of this series is not to rant about these things, but to describe how file formats might be read with the CDK. Here's a recipe with inline comments (in the Groovy syntax):

reader = new MDLV2000Reader(
  new File("data/azulene4.mol").newReader(),
azulene = Molecule());

// perceive atom types

// add missing hydrogens
adder = CDKHydrogenAdder.getInstance(

// if bond order 4 was present,
// deduce bond orders
dbst = new DeduceBondSystemTool();
azulene = dbst.fixAromaticBondOrders(azulene);

Carry would know this. In fact, Carry would probably have some comments on this recipe. Well, Carry can leave those in the comments of this post.