Pages

Monday, May 06, 2013

BridgeDB identity mapping file for metabolites

Based on HMDB 3.0 (January data; see the reference at the end), I have updated the identity mapping file for metabolites for BridgeDB. It can be downloaded here as pre-release. I here document how I created this file. I have placed the script online at GitHub so that anyone can look at the code and make custom BridgeDB Derby files too. (For example, for large CAS registry number mapping files.)

The code is some 100 lines long. The input .zip is processed by iterating over the containing XML files:


zipFile.entries().each { entry ->
   if (!entry.isDirectory()) {
     println entry.name

     inputStream = zipFile.getInputStream(entry)
     def rootNode = new XmlSlurper().parse(inputStream)

     String rootid = rootNode.accession.toString()
  }
}


For each file it creates an entry in the BridgeDB data:


  Xref ref = new Xref(rootid, BioDataSource.HMDB);
  error = database.addGene(ref);

  // add the synonyms
  addAttribute(
    database, ref, "Symbol",
    rootNode.common_name.toString()
  );
  addAttribute(
    database, ref, "Synonym",
    rootNode.traditional_iupac.toString()
  );
  addAttribute(
    database, ref, "Synonym",
    rootNode.iupac_name.toString()
  );

  // add the SMILES, InChIKey, etc
  addAttribute(
    database, ref, "InChIKey",
    cleanKey(rootNode.inchkey.toString())
   );
  addAttribute(
    database, ref, "SMILES",
    rootNode.smiles.toString()
  );
  addAttribute(
    database, ref, "BrutoFormula",
    rootNode.chemical_formula.toString()
  );
  addAttribute(
    database, ref, "Taxonomy Parent",
    rootNode.direct_parent.toString()
  );


These are some basic properties, and adding the mappings looks like:


  // add external identifiers
  addXRef(
    database, ref,
    rootNode.accession.toString(), BioDataSource.HMDB
  );
  addXRef(
    database, ref, 
    rootNode.cas_registry_number.toString(), casDS
  );
  addXRef(
    database, ref, 
    rootNode.inchi.toString(), inchiDS
  );
  addXRef(
    database, ref, 
    rootNode.chemspider_id.toString(), chemspiderDS
  );
  addXRef(
    database, ref, 
    rootNode.pubchem_compound_id.toString(), pubchemDS
  );
  addXRef(
    database, ref, 
    rootNode.chebi_id.toString(), chebiDS
  );
  addXRef(
    database, ref, 
    rootNode.kegg_id.toString(), keggDS
  );
  addXRef(
    database, ref, 
    rootNode.wikipedia.toString(), wikipediaDS
  );
  addXRef( 
    database, ref, 
    rootNode.drugbank_id.toString(), drugbankDS
  );
  addXRef(
    database, ref, 
    rootNode.nugowiki.toString(), nugoDS
  );


You can thus see that this new metabolites identity mapping file has mappings to various data sources, including ChemSpider, PubChem, ChEBI, KEGG, Wikipedia, DrugBank, and a few others. All these mappings are provided by HMDB 3.0.

The Derby database file is the format that PathVisio can read natively. This is the magic that is used in the script to create and finalize (aka save) this file:


GdbConstruct database = GdbConstructImpl3.createInstance(
  "hmdb_metabolites", new DataDerby(),
  DBConnector.PROP_RECREATE
);

database.createGdbTables();
database.preInsert();

String dateStr = new SimpleDateFormat("yyyyMMdd").
  format(new Date()
);
database.setInfo("BUILDDATE", dateStr);
database.setInfo("DATASOURCENAME", "HMDB3");
database.setInfo(
  "DATASOURCEVERSION",
  "hmdb_metabolites_" + dateStr
);
database.setInfo("DATATYPE", "Metabolite");
database.setInfo("SERIES", "standard_metabolite");

// process the zip file

database.commit();
database.finalize();


There are a few more details, but nothing not exposed by the full script.

Wishart, D. S. et al. Nucleic Acids Research 2013, 41, D801-D807.