Saturday, December 29, 2012

CTR #3: Report the similarity between two structures

The next CTR I picked is not particularly hard either, given the functionality provided by the CDK. In fact, the fingerprinting functionality I will use for this CTR is actually one of the most used and oldest features of the CDK. CiteULike has a list of 26 papers using the CDK fingerprinting functionality. The CDK 1.4.x API returns a Java BitSet and we can use the Tanimoto class to calculate the matching similarity values with it:
import org.openscience.cdk.fingerprint.*;
import org.openscience.cdk.smiles.*;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.similarity.*;

smilesParser = new SmilesParser(
smiles1 = "CC(C)C=CCCCCC(=O)NCc1ccc(c(c1)OC)O"
smiles2 = "COC1=C(C=CC(=C1)C=O)O"
mol1 = smilesParser.parseSmiles(smiles1)
mol2 = smilesParser.parseSmiles(smiles2)
fingerprinter = new HybridizationFingerprinter()
bitset1 = fingerprinter.getFingerprint(mol1)
bitset2 = fingerprinter.getFingerprint(mol2)
tanimoto = Tanimoto.calculate(bitset1, bitset2)
println "Tanimoto: $tanimoto"


  1. It might be worth adding that this is probabilistic similarity and to get a 'true' similarity one should use subgraph isomorphism and the smsd module. That would of course be a sperate topic all together but perhaps it is important to make the distinction,

    1. Absolutely true. Mind you, a 'true' similarity may be nice form the chemical graph perspective, but may not be more relevant to chemical, physical, or biological property prediction.