Saturday, December 30, 2017

Adding SMILES, InChI, etc to Wikidata alkane pages

Ten alkanes in Wikidata. The ones without CAS regsitry
number previously did not have InChIKey or
PubChem CID. But no more; I added those.
While working on the 'chemical class' aspect for Scholia yesterday I noted that the page for alkanes was quite large, with a list of more than 50 long chain alkanes with pages in the Japanese Wikipedia with no SMILES, InChI, InChIKey, etc.

So, I dug up my Bioclipse scripts to add chemicals to Wikidata starting with a SMILES (btw, the script has significantly evolved since) and extended the query of that Scholia aspect to list just the Wikidata Q-code and name.  This script starts with one or more SMILES strings and generated QuickStatements (a must-learner).

Because the Wikidata entries also had the English IUPAC name, I can use that to autogenerate SMILES. Enter the OPSIN (doi:10.1021/ci100384d) plugin for Bioclipse which in combination with the CDK allowed me to create the matching SMILES, InChI, InChIKey, and use the latter to look up the PubChem compound identifier (CID). This is the script I ended up with:

inputFile = "/Wikidata/Alkanes/alkanes.tsv"
new File(bioclipse.fullPath(inputFile)).eachLine { line ->
  fields = line.split("\t")
  if (fields[0].startsWith("")) {
    wdid = fields[0].substring("".length())
    name = fields[1]
    if (fields.length > 2) { // skip entities that already have an InChIKey
      inchikey = fields[2]
      // println "Skipping: $wdid $inchikey"
    } else { // ok, consider adding it
      // println "Considering $wdid $name"
      try {
        mol = opsin.parseIUPACName(name)
        smiles = cdk.calculateSMILES(
        //println "  SMILES: $smiles"
        println "${smiles}\t${wdid}"
      } catch (Exception error) {
        //println "Could not parse $name with OPSIN: ${error.message}" 

That way, I ended up with changes like this:

No comments:

Post a Comment