May
18
Comparing sets of identifiers: the Bioclipse implementation
The problem
That sounds easy: take two collection of identifiers, put them in sets, determine the intersection, done. Sadly, each collection uses identifiers from different databases. Worse, within one set identifiers from multiple databases. Mind you, I'm not going full monty, though some chemistry will be involved at some point. Instead, this post is really based on identifiers.
The example
Data set 1:
Data set 2: all metabolites from WikiPathways. This set has many different data sources, and seven provide more than 100 unique identifiers. The full list of metabolite identifiers is here.
The goal
Determine the interaction of two collections of identifiers from arbitrary databases, ultimately using scientific lenses.
That sounds easy: take two collection of identifiers, put them in sets, determine the intersection, done. Sadly, each collection uses identifiers from different databases. Worse, within one set identifiers from multiple databases. Mind you, I'm not going full monty, though some chemistry will be involved at some point. Instead, this post is really based on identifiers.
The example
Data set 1:
Data set 2: all metabolites from WikiPathways. This set has many different data sources, and seven provide more than 100 unique identifiers. The full list of metabolite identifiers is here.
The goal
Determine the interaction of two collections of identifiers from arbitrary databases, ultimately using scientific lenses.