![]() |
| Russian Wikipedia on tungsten hexacarbonyl. |
Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
- SPARQL for all SMILES strings
- process each one of them with the CDK SMILES parser
identifier = "P233" // SMILES
type = "smiles"
sparql = """
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?compound ?smiles WHERE {
?compound wdt:P233 ?smiles .
}
"""
mappings = rdf.sparqlRemote("https://query.wikidata.org/sparql", sparql)
outFilename = "/Wikidata/badWikidataSMILES.txt"
if (ui.fileExists(outFilename)) ui.remove(outFilename)
fileContent = ""
for (i=1; i<=mappings.rowCount; i++) {
try {
wdID = mappings.get(i, "compound")
smiles = mappings.get(i, "smiles")
mol = cdk.fromSMILES(smiles)
} catch (Throwable exception) {
fileContent += (wdID + "," + smiles + ": " +
exception.message + "\n")
}
if (i % 1000 == 0) js.say("" + i)
}
ui.append(outFilename, fileContent)
ui.open(outFilename)
It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)
At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):

No comments:
Post a Comment