Sunday, December 30, 2012

CTR: resolving dependencies in Groovy with Grapes

+Mark Fortner commented on Google+ that my script was missing a @Grab statement. I had seen that mentioned before, but never looked at it. It turns out the be very useful, and it makes Groovy scripts standalone. That is, it will resolve the missing dependencies, using Maven repositories. Fortunately, CDK modules are available from repositories, e.g. the one at Plovdiv University, and I gave it a try. My first attempt went bad, but +Nina Jeliazkova explained me what Maven mistake I was making, and now I got a working setup:

    module='cdk-io', version='1.4.11'
    module='cdk-silent', version='1.4.11' 

So, depending on the exact script, something like the above will remove any need of setting CLASSPATHs. The above Groovy code is for counting heavy atoms.

Saturday, December 29, 2012

CTR #3: Report the similarity between two structures

The next CTR I picked is not particularly hard either, given the functionality provided by the CDK. In fact, the fingerprinting functionality I will use for this CTR is actually one of the most used and oldest features of the CDK. CiteULike has a list of 26 papers using the CDK fingerprinting functionality. The CDK 1.4.x API returns a Java BitSet and we can use the Tanimoto class to calculate the matching similarity values with it:
import org.openscience.cdk.fingerprint.*;
import org.openscience.cdk.smiles.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.similarity.*;

smilesParser = new SmilesParser(
smiles1 = "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O"
smiles2 = "COC1=C(C=CC(=C1)C=O)O"
mol1 = smilesParser.parseSmiles(smiles1)
mol2 = smilesParser.parseSmiles(smiles2)
fingerprinter = new HybridizationFingerprinter()
bitset1 = fingerprinter.getFingerprint(mol1)
bitset2 = fingerprinter.getFingerprint(mol2)
tanimoto = Tanimoto.calculate(bitset1, bitset2)
println "Tanimoto: $tanimoto"

CTR #2: Depict a compound as an image

This one was relatively easy, and roughly based on the first CDK-JChemPaint tutorial. Key aspects are the SMILES parsing, 2D coordinate generation with the StructureDiagramGenerator. The solution does not render the structure's title yet. I have do not have a solution for that right now (the CDK may; I am not sure).

import java.util.List;
import java.awt.*;
import java.awt.image.*;
import javax.imageio.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.layout.*;
import org.openscience.cdk.renderer.*;
import org.openscience.cdk.renderer.font.*;
import org.openscience.cdk.renderer.generators.*;
import org.openscience.cdk.renderer.visitor.*;
import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.templates.*;
import org.openscience.cdk.renderer.generators.BasicSceneGenerator.Margin;
import org.openscience.cdk.renderer.generators.BasicSceneGenerator.ZoomFactor;

smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"
int WIDTH = 200
int HEIGHT = 250
Rectangle drawArea = new Rectangle(WIDTH, HEIGHT);
Image image = new BufferedImage(
smilesParser = new SmilesParser(
molecule = smilesParser.parseSmiles(smiles)
StructureDiagramGenerator sdg =
  new StructureDiagramGenerator();
molecule = sdg.getMolecule();
List generators =
  new ArrayList();
generators.add(new BasicSceneGenerator());
generators.add(new BasicBondGenerator());
generators.add(new BasicAtomGenerator());
AtomContainerRenderer renderer =
  new AtomContainerRenderer(
    generators, new AWTFontManager()
renderer.setup(molecule, drawArea);
model = renderer.getRenderer2DModel();
model.set(ZoomFactor.class, (double)0.9);
Graphics2D g2 = (Graphics2D)image.getGraphics();
g2.fillRect(0, 0, WIDTH, HEIGHT);
renderer.paint(molecule, new AWTDrawVisitor(g2));
  (RenderedImage)image, "PNG",
  new File("CTR2.png")

CTR #1: Heavy atom counts from an SD file

The first Chemistry Toolkit Rosetta task is to count the number of heavy atoms in the structures given in a MDL SD file. This tasks starts with an SD file and counts for each structure in the file the number of heavy atoms (non-hydrogen atoms). Because we simply handle the structures one by one, the solution uses the IteratingMDLReader reader. The input file (benzodiazepine.sdf.gz) is a gziped file, which we handle by using a GZIPInputStream. Because we want to make sure the input file does not have any unexpected content, we use the STRICT mode. The input file turns out to do not have non-standard features, so that we do not have to worry about D and T element symbols.

The solution lists all heavy atom counts:

import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.silent.*;

iterator = new IteratingMDLReader(
  new GZIPInputStream(
    new File("benzodiazepine.sdf.gz")
while (iterator.hasNext()) {
  mol =
  heavyAtomCount = 0
  for (atom in mol.atoms()) {
    if (1 == atom.atomicNumber ||
        "H".equals(atom.symbol)) {
      // do not count hydrogens
    } else {
  println heavyAtomCount

Chemistry Toolkit Rosetta: common cheminformatics tasks

The Chemistry Toolkit Rosetta wiki was set up some time ago by Andrew Dalke to demonstrate how certain basic cheminformatics tasks are done in the various cheminformatics toolkits around. I think it is a great idea, but never found enough time to do much with it, unfortunately. But it is holiday now, which is a time to take your mind of your work, and then some random hacking with the CDK is what I like to do.

In a series of blog posts I will attempt to solve all 18 CTR tasks, and use this post to keep track of the coverage. The currently listed tasks are given in the below list, and link to the tasks description wiki page, where I will upload my solution. As I will use the solutions also as a chapter in my Groovy Cheminformatics with the Chemistry Development Kit book, I will use Groovy as the programming language. I will also create one blog post for each solution, giving details.

The CTR tasks:

Tuesday, December 25, 2012

Groovy Cheminformatics with the CDK: 7th edition

Already the 7th edition of my Groovy Cheminformatics with the Chemistry Development Kit book (and PDF eBook). It has been almost two years since the first release and has grown from an initial 72 pages to 212 pages today. There is still a lot I still want to write about, but only during the holidays I have time for it. New content includes:
  • Chapter 6. Reactions
  • Chapter 7. From IChemObject to IChemFile
  • Section 17.1.2. Stereoisomerism (in InChIs)
  • Rewrote Chapter 20. Documentation
  • Appendix D.1. The Readers and Writers (e.g. listing all IO options)
The latter looks like this section:

I also like to stress that all literature references in the book are freely available from CiteULike, and that page looks like:

The group's RSS feed gives a good idea on what I am writing about.