## Saturday, October 31, 2015

### How to test SMILES strings in Supplementary Information

When you stumble upon a nice paper describing a new predictive or explanatory model for a property or a class of compounds that has your interest, the first thing you do is test the training data. For example, validating SMILES (or OpenSMILES) strings in such data files is now easy with the many Open Source tools that can parse SMILES strings: the Chemistry Toolkit Rosetta provides many pointers for parsing SMILES strings. I previously blogged about a CDK/Groovy approach.

Cheminformatics toolkits need to understand what the input is, in order to correctly calculate descriptors. So, let's start there. It does not matter so much which toolkit you use and I will use the Chemistry Development Kit (doi:10.1021/ci025584y) here to illustrate the approach.

Let's assume we have a tab-separated values file, with the compound identifier in the first column and the SMILES in the second column. That can easily be parsed in Groovy. For each SMILES we parse it and determine the CDK atom types. For validation of the supplementary information we only want to report the fails, but let's first show all atom types:

import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.atomtype.CDKAtomTypeMatcher;

parser = new SmilesParser(
SilentChemObjectBuilder.getInstance()
);
matcher = CDKAtomTypeMatcher.getInstance(
SilentChemObjectBuilder.getInstance()
);

new File("suppinfo.tsv").eachLine { line ->
fields = line.split(/\t/)
id = fields[0]
smiles = fields[1]
if (smiles != "SMILES") { // header line
mol = parser.parseSmiles(smiles)
println "$id ->$smiles";

// check CDK atom types
types = matcher.findMatchingAtomTypes(mol);
types.each { type ->
if (type == null) {
report += "  no CDK atom type\n"
} else {
println "  atom type: " + type.atomTypeName
}
}
}
}

This gives output like:

mo1 -> COC
atom type: C.sp3
atom type: O.sp3
atom type: C.sp3

If we rather only report the errors, we make some small modifications and do something like:

new File("suppinfo.tsv").eachLine { line ->
fields = line.split(/\t/)
id = fields[0]
smiles = fields[1]
if (smiles != "SMILES") {
mol = parser.parseSmiles(smiles)
errors = 0
report = ""

// check CDK atom types
types = matcher.findMatchingAtomTypes(mol);
types.each { type ->
if (type == null) {
errors += 1;
report += "  no CDK atom type\n"
}
}

// report
if (errors > 0) {
println "$id ->$smiles";
print report;
}
}
}

Alternatively, you can use the InChI library to do such checking. And here too, we will use the CDK and the CDK-InChI integration (doi:10.1186/1758-2946-5-14).

factory = InChIGeneratorFactory.getInstance();

new File("suppinfo.tsv").eachLine { line ->
fields = line.split(/\t/)
id = fields[0]
smiles = fields[1]
if (smiles != "SMILES") {
mol = parser.parseSmiles(smiles)

// check InChI warnings
generator = factory.getInChIGenerator(mol);
if (generator.returnStatus != INCHI_RET.OKAY) {
println "$id ->$smiles";
println generator.message;
}
}
}

The advantage of doing this, is that it will also give warnings about stereochemistry, like:

mol2 -> BrC(I)(F)Cl
Omitted undefined stereo

I hope this gives you some ideas on what to do with content in supplementary information of QSAR papers. Of course, this works just as well for MDL molfiles. What kind of validation do you normally do?