Pages

Saturday, October 31, 2015

So, now you have SMILES that are faulty... visualize them?

So, you validated your list of SMILES in the paper you were planning to use (or about to submit), and you found a shortlist of SMILES strings that do not look right. Well, let's visualize them.

We all used to use the Daylight Depict tool, but this is no longer online. I blogged previously already about using AMBIT for SMILES depiction (which uses various tools for depiction; doi:10.1186/1758-2946-3-18), but now John May released a CDK-only tool, called CDK Depict. The download section offers a jar file and a war for easy deployment in a Tomcat environment. But for the impatient, there is also this online host where you can give it a try (it may go offline at some point?).


Just copy/paste your shortlist there, and visually see what is wrong with them :) Big HT to John for doing all these awesome things!

How to test SMILES strings in Supplementary Information

Source. License: CC-BY 2.0.
When you stumble upon a nice paper describing a new predictive or explanatory model for a property or a class of compounds that has your interest, the first thing you do is test the training data. For example, validating SMILES (or OpenSMILES) strings in such data files is now easy with the many Open Source tools that can parse SMILES strings: the Chemistry Toolkit Rosetta provides many pointers for parsing SMILES strings. I previously blogged about a CDK/Groovy approach.

Cheminformatics toolkits need to understand what the input is, in order to correctly calculate descriptors. So, let's start there. It does not matter so much which toolkit you use and I will use the Chemistry Development Kit (doi:10.1021/ci025584y) here to illustrate the approach.

Let's assume we have a tab-separated values file, with the compound identifier in the first column and the SMILES in the second column. That can easily be parsed in Groovy. For each SMILES we parse it and determine the CDK atom types. For validation of the supplementary information we only want to report the fails, but let's first show all atom types:

import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.atomtype.CDKAtomTypeMatcher;

parser = new SmilesParser(
  SilentChemObjectBuilder.getInstance()
);
matcher = CDKAtomTypeMatcher.getInstance(
  SilentChemObjectBuilder.getInstance()
);

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") { // header line
    mol = parser.parseSmiles(smiles)
    println "$id -> $smiles";

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        report += "  no CDK atom type\n"
      } else {
        println "  atom type: " + type.atomTypeName
      }
    }
  }
}

This gives output like:

mo1 -> COC
  atom type: C.sp3
  atom type: O.sp3
  atom type: C.sp3

If we rather only report the errors, we make some small modifications and do something like:

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)
    errors = 0
    report = ""

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        errors += 1;
        report += "  no CDK atom type\n"
      }
    }

    // report
    if (errors > 0) {
      println "$id -> $smiles";
      print report;
    }
  }
}

Alternatively, you can use the InChI library to do such checking. And here too, we will use the CDK and the CDK-InChI integration (doi:10.1186/1758-2946-5-14).

factory = InChIGeneratorFactory.getInstance();

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)

    // check InChI warnings
    generator = factory.getInChIGenerator(mol);
    if (generator.returnStatus != INCHI_RET.OKAY) {
      println "$id -> $smiles";
      println generator.message;
    }
  }
}

The advantage of doing this, is that it will also give warnings about stereochemistry, like:

mol2 -> BrC(I)(F)Cl
  Omitted undefined stereo

I hope this gives you some ideas on what to do with content in supplementary information of QSAR papers. Of course, this works just as well for MDL molfiles. What kind of validation do you normally do?