Sunday, May 27, 2012

Finding where to put double bonds...

SMILES has a convenient feature to mark elements from the organic subset in lower case, indicating a particular hybridization state (aromaticity). The locations of double bonds are then not explicitly given, reflecting the delocalized nature of those systems:

However, there are many situations where you do like to know the position of those double bonds, or at least on solution of the set of possible combinations, such as:

Finding the positions of the double bonds is one of the core algorithms in cheminformatics. The CDK had a few algorithms for a long time, one looking at ring systems (DeduceBondSystemTool) and one tackling a more general problem (SaturationChecker); the first was recently found to be slow, caused by the use of the AllRingsFinder (which is slow because of the combinatorial set of ring combinations), and the second never really work that well, because it did not use the CDK atom type perception code.

Recently, Kevin and Klas set off in parallel to develop new implementations. Kevin focusing on improving the DeduceBondSystemTool, and Klas starting from the more general use case.

Kevin's new code was tested by Nina, and found to behave pretty well, with an error rate of well below 1%. Klas' code is still being developed, but I am very much looking forward to his code, as it is not limited to ring systems.

That said, Kevin's code has been merged into the cdk-1.4.x branch, and will be part of the next release, and is ready to be used now. The basic use is pretty simple when starting with SMILES:

  String smiles = "c2ccc3n([H])c1ccccc1c3(c2)";
  SmilesParser smilesParser = new SmilesParser(
  IMolecule molecule = smilesParser.parseSmiles(smiles);
  FixBondOrdersTool fbot = new FixBondOrdersTool();      
  molecule = fbot.kekuliseAromaticRings(molecule);

Thanx to Kevin for this great tool!