Sunday, July 01, 2018

LIPID MAPS identifiers and endocannabinoids

Maybe I will find some more time later, but for now just a quick notice of a open notebook I kept yesterday for adding more LIPID MAPS identifiers in Wikidata. It started with a node in a WikiPathways which did not have an identifier: endocannabinoids:
This is why I am interested in Wikidata, as I can mint entries there myself (see this ICCS 2018 poster). And so I did, but when adding a chemical class, you want to specific compounds from that class too. That's where LIPID MAPS comes in, because that had info on specific compounds in that class.

Some time ago I asked about adding more LIPID MAPS identifiers to Wikidata, which has a lot of benefits for the community and LIPID MAPS. I was informed I could use their REST API to get mappings between InChIKey and their identifiers, and that is enough for me to add more of their identifiers to Wikidata (similar approach I used for the EPA CompTox Dashboard and SPLASHes). The advantages include that LIPID MAPS now can easily get data to add links to the PDB and MassBank to their lipid database (and much more).

My advantage is that I can easily query if a particular compound is a specific endocannabinoids. I created two Bioclipse scripts, and one looks like:

// ask permission to use data from their REST API (I did and got it)

restAPI = ""
propID = "P2063"

allData = bioclipse.downloadAsFile(
  restAPI, "/LipidMaps/lipidmaps.txt"

sparql = """
PREFIX wdt: 
SELECT (substr(str(?compound),32) as ?wd) ?key ?lmid WHERE {
  ?compound wdt:P235 ?key .
  MINUS { ?compound wdt:${propID} ?lmid . }

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "", sparql

def renewFile(file) {
  if (ui.fileExists(file)) ui.remove(file)
  return file

mappingsFile = "/LipidMaps/mappings.txt"
missingCompoundFile = "/LipidMaps/missing.txt"

// ignore certain Wikidata items, where I don't want the LIPID MAPS ID added
ignores = new java.util.HashSet();
// ignores.add("Q37111097")

// make a map
map = new HashMap()
for (i=1;i<=results.rowCount;i++) {
  rowVals = results.getRow(i)
  map.put(rowVals[1], rowVals[0])  

batchSize = 500
batchCounter = 0
mappingContent = ""
missingContent = ""
print "Saved a batch"
new File(bioclipse.fullPath("/LipidMaps/lipidmaps.txt")).eachLine{ line ->
  fields = line.split("\t")
  if (fields.length > 15) {
    lmid = fields[1]
    inchikey = fields[15]
    if (inchikey != null && inchikey.length() > 10) {
      if (map.containsKey(inchikey)) {
        wdid = map.get(inchikey)
        if (!ignores.contains(wdid)) {
          mappingContent += "${wdid}\t${propID}\t\"${lmid}\"\tS143\tQ20968889\tS854\t\"\"\tS813\t+2018-06-30T00:00:00Z/11\n"
      } else {
        missingContent += "${inchikey}\n"
  if (batchCounter >= batchSize) {
    ui.append(mappingsFile, mappingContent)
    ui.append(missingCompoundFile, missingContent)
    batchCounter = 0
    mappingContent = ""
    missingContent = ""
    print "."
println "\n"

With that, I managed to increase the number of LIPID MAPS identifiers from 2333 to 6099, but there are an additional 38 thousand lipids not yet in Wikidata.

Many more details can be found in my notebook, but in the end I ended up with a nice Scholia page for endocannabinoids :)