Sunday, March 13, 2011

Unfortunate (CDK) decisions

Sometimes you make decisions which turn out unlucky. Christoph, Dan, and I made one about 10 years ago (see this photo from Dan with his whiteboard where we wrote down our thoughts). The one I am talking about is not bad in itself, but the implementation is bad; bad in the sense that it makes it easy for other toolkits to make a difference in size and performance (see also Quo vadis, CDK?).

The reason why I bring this up, is that I am answering a question on the CCL mailing list on how to convert a XYZ file into an adjacency matrix, and my Groovy Cheminformatics did not have a section on graph matrices yet. And I just ran into the issue that the CDK XYZReader only accepts an IChemFile. Then I realized the book doesn't explain yet why that is either.

It basically comes down that chemical files can contain a wide variety of chemistry. MDL molfiles can contain a single Molecule, but also a set of molecules. Chemical documents in general, think Jmol, can also contain trajectories for cyclohexane (see this nice animation by Bob), etc, etc. And because the CDK is a general chemistry development toolkit, it has to support all. So, we have the IChemFile concept which corresponds to a chemical file, containing a sequence of one or more models, which may be if a single molecules (e.g. geometry optimizations), or multiple models (e.g. MDL SD files). Similarly, each model can contain a variety of objects: reactions, set of molecules, and crystals. Well, you get the point.

Now, at some point, it was decided that being able to convert file formats from one into another might in fact be interesting (think OpenBabel). That's where it went wrong (and if you start git blaming it would not surprise me if I am the center of all evil here). Just to keep it simple, it made sense to have all IO classes at least support ChemFile, because whatever the content, it could always be wrapped in an ChemFile.

So, the answer to the question on the CCL mailing list is a bit more involved than I wanted:
import org.openscience.cdk.*;
import org.openscience.cdk.config.*;
import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.graph.matrix.*;
import org.openscience.cdk.graph.rebond.*;

reader = new XYZReader(
  new File("data/").newReader()
allContent = ChemFileManipulator.getAllAtomContainers( ChemFile())
ethanoicAcid = allContent.get(0)

factory = AtomTypeFactory.getInstance(
for (IAtom atom : ethanoicAcid.atoms()) {
RebondTool rebonder = new RebondTool(2.0, 0.5, 0.5);

int[][] matrix = AdjacencyMatrix.getMatrix(ethanoicAcid)

println "The adjaceny matrix:"
for (row=0;row<ethanoicAcid.getAtomCount();row++) {
  for (col=0;col<ethanoicAcid.getAtomCount();col++) {
    print matrix[row][col] + " "
  println ""
This code gives this output:
The adjaceny matrix:
0 1 1 1 1 0 0 0 
1 0 0 0 0 1 1 0 
1 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 
0 1 0 0 0 0 0 0 
0 1 0 0 0 0 0 1 
0 0 0 0 0 0 1 0 

Note that the RebondTool didn't pick up the bond order, but that's not needed for the adjacency matrix anyway.

Getting back on the unfortunate design decision: what would things have looked liked if we had typed our IChemObject reading and writing classes, allowing us to differentiate IChemObjectReader<IAtomContaienr> from IChemObjectReader<IChemFile>. It would have greatly reduced the size of simple programs, or something like the wished-for 50k JChemPaint applet. Right now, pulling in a single reader, will automatically pull in all IChemObject classes. Rather, I would just pull in IAtomContainer, which in itself is too large too. But that's for another blog post.

The key thing to remember is: don't be afraid to add some complexity if that allows you to make other things optional.


  1. A sortof related problem is reading in atomcontainers (or molecules) with a different storage representation.

    The VFLib port Asad was working on this week really needs an adjacency list representation. So something like a Map> to quickly get the neighbour indices of each atom index.

    Of course, it is possible to have an implementation (IndexedMolecule) that provides a quicker getConnectedAtomsList method, but then you have to go through each atom reference, and look it up in the atom list for its index.

    The problem, then is this : should IAtomContainer have a method "List getConnectedAtomIndices(int atomIndex)"? This then forces other implementations to provide this too.

    I have a feeling that these kinds of problems are why each project has its own Atom class and Molecule class :)

  2. Oh, and more on topic is the problem I've just realised - that you can't do this:

    ISimpleChemObjectReader reader = getReader(...);
    IMolecule molecule = Molecule());

    where 'getReader' is some function returning a reader based on (for example) a string or constant.

    The ChemObjectReader would be excellent in this case. Well, I think.

  3. What about the ReaderFactory, together with a StringReader?

    Regarding your other question... yes, I think we should add such a method, and move these methods to the AtomContainerManipulator or so... to make the AtomContainer class smaller.