Monday, January 02, 2017

EPA CompTox Dashboard IDs in Wikidata

After Antony Williams left the ChemSpider team, he moved on to the EPA. Since then, he has set up the EPA CompTox Dashboard (see also doi:10.1007/s00216-016-0139-z [€]). And in August he was kind enough to upload mappings between InChIKeys (doi:10.1186/s13321-015-0068-4) and their identifiers on Figshare (doi:10.6084/m9.figshare.3578313.v1) as a tab-separated values (TSV) file. Because this database is of interest to our pathway and systems biology work, I realized I wanted ID-ID mappings in our BridgeDb identifier mappings files (doi:10.1186/1471-2105-11-5). As I wrote earlier, I have adopted Wikidata (doi:10.3897/rio.1.e7573) as data source. So, entering these new identifiers in Wikidata is helpful.

Somewhere in the past few months I proposed the needed Wikidata property, P3117 ("DSSTOX substance identifier"), which was approved some time later. For entering the mappings, I have opted to write a Bioclipse script (doi:10.1186/1471-2105-10-397) that uses the Wikidata SPARQL endpoint to get about 150 thousand Wikidata item identifiers (Q-codes) and their InChIKeys. I then parses over the lines in the TSV file from Figshare and creates input for Wikidata for each match, based on exact InChIKey string equivalence.

This output is formatted QuickStatements instructions, a great tool set up by Magnus Manske. Each line looks like (here for N6-methyl-deoxy-adenosine-5'-monophosphate, aka Q27456455):

Q27456455 P3117 "DTXSID30678817" S248 Q28061352

The P248 ("stated in") property is used to link the source (hence: S248) information as reference, with points to the Q28061352 item which is for the Figshare entry for Tony's mapping data. The result in this Wikidata item looks like:

I entered about 36 thousand of such statements to Wikidata. Thus, the yield is about 5%, calculating from the CompTox Dashboard as starting point with about 720 thousand identifiers. From a Wikidata perspective, the yield is higher. There are about 150 thousand items with an InChIKey, so that 24% could be mapped.

Based on properties of the property, it does some automatic validation. For example, it is specified that any Wikidata item can only have one DSSTOX substance identifier, because it can only have one InChIKey too. Similarly, there can not be two Wikidata items with the same DSSTOX identifier. Normally, because because of how Wikidata works, there can be isolated examples. With less then 25 constraint violations, the quality of the process turned out pretty high (>99.9%).

Some of the issues have been manually inspected. Causes vary. One issue was that the Wikidata item in fact had more than one InChIKey. A possible reason for that is that it does not distinguish between various forms of a compound. Two Wikidata items have been split up accordingly. Other problems are due to features of the CompTox Dashboard, and some issues have been tweeted to the Dashboard team.

This mashup of these two resources, as anticipated in our H2020 proposal (doi:10.3897/rio.1.e7573), makes it possible to easily make slices of data. For example, we can query for experimental data for compounds in the EPA CompTox Dashboard with a SPARQL query like for the dipole moment:

Importantly, this query shows the source where this data comes from, one of the advantages of Wikidata.