The code is some 100 lines long. The input .zip is processed by iterating over the containing XML files:
zipFile.entries().each { entry ->
if (!entry.isDirectory()) {
println entry.name
inputStream = zipFile.getInputStream(entry)
def rootNode = new XmlSlurper().parse(inputStream)
String rootid = rootNode.accession.toString()
}
}
For each file it creates an entry in the BridgeDB data:
Xref ref = new Xref(rootid, BioDataSource.HMDB);
error = database.addGene(ref);
// add the synonyms
addAttribute(
database, ref, "Symbol",
rootNode.common_name.toString()
);
addAttribute(
database, ref, "Synonym",
rootNode.traditional_iupac.toString()
);
addAttribute(
database, ref, "Synonym",
rootNode.iupac_name.toString()
);
// add the SMILES, InChIKey, etc
addAttribute(
database, ref, "InChIKey",
cleanKey(rootNode.inchkey.toString())
);
addAttribute(
database, ref, "SMILES",
rootNode.smiles.toString()
);
addAttribute(
database, ref, "BrutoFormula",
rootNode.chemical_formula.toString()
);
addAttribute(
database, ref, "Taxonomy Parent",
rootNode.direct_parent.toString()
);
These are some basic properties, and adding the mappings looks like:
// add external identifiers
addXRef(
database, ref,
rootNode.accession.toString(), BioDataSource.HMDB
);
addXRef(
database, ref,
rootNode.cas_registry_number.toString(), casDS
);
addXRef(
database, ref,
rootNode.inchi.toString(), inchiDS
);
addXRef(
database, ref,
rootNode.chemspider_id.toString(), chemspiderDS
);
addXRef(
database, ref,
rootNode.pubchem_compound_id.toString(), pubchemDS
);
addXRef(
database, ref,
rootNode.chebi_id.toString(), chebiDS
);
addXRef(
database, ref,
rootNode.kegg_id.toString(), keggDS
);
addXRef(
database, ref,
rootNode.wikipedia.toString(), wikipediaDS
);
addXRef(
database, ref,
rootNode.drugbank_id.toString(), drugbankDS
);
addXRef(
database, ref,
rootNode.nugowiki.toString(), nugoDS
);
You can thus see that this new metabolites identity mapping file has mappings to various data sources, including ChemSpider, PubChem, ChEBI, KEGG, Wikipedia, DrugBank, and a few others. All these mappings are provided by HMDB 3.0.
The Derby database file is the format that PathVisio can read natively. This is the magic that is used in the script to create and finalize (aka save) this file:
GdbConstruct database = GdbConstructImpl3.createInstance(
"hmdb_metabolites", new DataDerby(),
DBConnector.PROP_RECREATE
);
database.createGdbTables();
database.preInsert();
String dateStr = new SimpleDateFormat("yyyyMMdd").
format(new Date()
);
database.setInfo("BUILDDATE", dateStr);
database.setInfo("DATASOURCENAME", "HMDB3");
database.setInfo(
"DATASOURCEVERSION",
"hmdb_metabolites_" + dateStr
);
database.setInfo("DATATYPE", "Metabolite");
database.setInfo("SERIES", "standard_metabolite");
// process the zip file
database.commit();
database.finalize();
There are a few more details, but nothing not exposed by the full script.
No comments:
Post a Comment