Sunday, January 20, 2019

Updated HMDB identifier scheme #2: Wikidata updated

About a year ago the HMDB changed there identifier scheme: they added two digits to accommodate for more metabolites. They basically ran out of identifiers. This weekend I updated the HMDB identifiers in Wikidata, so that they are all in the new format, removing a lot of secondary (old) identifiers. The process was simple, combining Bioclipse, the Wikidata Query Service, and QuickStatements:

  1. use SPARQL to find all HMDB identifiers of length 9
  2. make QuickStatements to remove the old identifier and add the new identifier
  3. run the QuickStatements
QuickStatements website with the first 10 statements to update the
HMDB identifiers for 5 Wikidata compounds.
I ran the statements in batches, allowing me to keep track of the progress. Some reflection: there was quite a bit of references on the statements that got lost. The previous HMDB identifiers were often sourced from ChEBI. But the new identifiers do not come from there, they're sourced from Wikidata and adding "stated in" "Wikidata" did not make sense to me. Another thought is that it would have been nice to combine the removal and addition in one edit, but since they are executed right after each other, the version control will keep them together anyway.

The Bioclipse script can be found here.

