Sunday, December 27, 2015

The quality of SMILES strings in Wikidata

Russian Wikipedia on tungsten hexacarbonyl.
One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
  1. SPARQL for all SMILES strings
  2. process each one of them with the CDK SMILES parser
I can do both easily in Bioclipse with an integrated script:

identifier = "P233" // SMILES
type = "smiles"

sparql = """
PREFIX wdt: <>
SELECT ?compound ?smiles WHERE {
  ?compound wdt:P233 ?smiles .
mappings = rdf.sparqlRemote("", sparql)

outFilename = "/Wikidata/badWikidataSMILES.txt"
if (ui.fileExists(outFilename)) ui.remove(outFilename)
fileContent = ""
for (i=1; i<=mappings.rowCount; i++) {
  try {
    wdID = mappings.get(i, "compound")
    smiles = mappings.get(i, "smiles")
    mol = cdk.fromSMILES(smiles)
  } catch (Throwable exception) {
    fileContent += (wdID + "," + smiles + ": " +

                   exception.message + "\n")
  if (i % 1000 == 0) js.say("" + i)
ui.append(outFilename, fileContent)

It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)

At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):