Monday, September 12, 2016

Metabolite identifier mapping databases

Caffeine metabolites. Source: Wikimedia.
If you want to map experimental data to (digital) biological pathways, you need to know what measured datum matches which metabolite in the pathways (that also applies to transcriptomics and proteomics data, of course). However, if a pathways does not have a single database from which identifiers are used, or your analysis platform outputs data with CAS registry numbers, then you need something like identifier mapping. In Maastricht we use BridgeDb for that, and I develop the metabolite identifier mapping databases, which provide the mapping data to BridgeDb, which performs the mapping.

However, identifier mapping for metabolites is non-trivial, and I won't got into details in this post. Instead, the mapping databases that I have been releasing under the CCZero waiver on Figshare use other data sources. When I took over the building of these databases, it used data from the Human Metabolome Database (doi:10.1093/nar/gks1065). It still does. However, I added as data sources to this, ChEBI (doi:10.1093/nar/gkv1031) and Wikidata. The latter I need to support people with, for example, KNApSAcK (doi:10.1093/pcp/pct176).

So, this weekend I released a new mapping database, based on HMDB 3.6, ChEBI 142, and data from Wikidata from September 7. Here are the total number of identifiers and changes compared to June release for the supported identifier databases:

Number of ids in Kd (KEGG Drug): 2013 (unchanged)
Number of ids in Cks (KNApSAcK): 4357 (unchanged)
Number of ids in Ik (InChIKey): 52337 (unchanged)
Number of ids in Ch (HMDB): 41520 (6 added, 0 removed -> overall changed +0.0%)
Number of ids in Wd (Wikidata): 22648 (195 added, 10 removed -> overall changed +0.8%)
Number of ids in Cpc (PubChem-compound): 30699 (154 added, 36 removed -> overall changed +0.4%)
Number of ids in Lm (LIPID MAPS): 2611 (unchanged)
Number of ids in Ce (ChEBI): 131580 (4 added, 6 removed -> overall changed -0.0%)
Number of ids in Ck (KEGG Compound): 15968 (unchanged)
Number of ids in Cs (Chemspider): 24948 (10 added, 2 removed -> overall changed +0.0%)
Number of ids in Wi (Wikipedia): 4906 (unchanged)

An overview of recent releases (I'm trying to keep a monthly schedule) can be found here and the version I release this weekend has doi:10.6084/m9.figshare.3817386.v1.