chem-bla-ics: BridgeDb NWO grant update #3: Pandora's box

Overview of BridgeDb git
repositories.

We (Denise, Martina, Tooba (funded by the FAIRplus project), Helena, and me) really picked up momentum with the NWO Open Science grant for BridgeDb. On the right you can see the activity in the BridgeDb Organisation on GitHub: at least nine repositories have seen activity in the past 2 weeks. Some updates:

Tooba released new gene/protein ID mapping databases, e.g. for Human
Helena has been working on the BridgeDb Java code base, refactoring more code for maintainability
a new Bioconductor release with an updated BridgeDbR package
worked on secondary identifiers

That said, I have not been able to make a new release of the BridgeDb Java library. I want to give the upcoming BridgeDb Java 3.0.13 version a bit more testing. For example, Marvin and I have not been able to get Docker working with this version yet.

Secondary identifiers

Some years ago we had a Google Summer of Code student learning about open source development on a project to develop an API for secondary identifiers. Many databases do very critical curation work of knowledge from literature, where the journal articles often contain incomplete or even wrong data. When the databases correct the scientific record this also often means updating the record and when that is done right, a new database identifier is created. There are other reasons. For example, sometimes the database runs out of identifiers and the even change the full identifier scheme. So, we need a mechanism to say that some identifier is outdated (for some reason) and, optionally, what identifier should likely be used instead.

Tooba has been working on this and last week we made a good step forward and we updated the BridgeDb Java library so that it works nicely together with the BridgeDbR package (development release) which now adds information about the identifier being primary or secondary:

> library(BridgeDbR)
> hgnc = BridgeDbR::loadDatabase("hgncSymbol.bridge")
> BridgeDbR::map(hgnc, "H", "TIC1")
          source identifier target mapping isPrimary
H:TIC1:F        H       TIC1      H    TIC1         F
H:SPOCK1:T      H       TIC1      H SPOCK1         T

BridgeDb Hackathon

This development allows Denise and me to finish this for metabolite identifiers too (ChEBI has a lot of secondary ID information). And this brings me to an end for now. We're slowly heading towards the holiday and conference season. Upcoming for this project is a two day hackathon with the full team and more. It's scheduled for July 7/8 in or around Maastricht. Details will follow asap. The focus will be on this NWO grant goals, but if other NWO Open Science projects are interested in joining, please email me.

Oh, what about Pandora's box? Well, every step we take now highlights things we want to fix or improve. We continuously run into design decisions that we want to change. For example, BridgeDb support attributes (like names, SMILES, etc) but they are only linked to the central identifier and mapping is not currently applied before looking up the attributes: only when you use the source database identifier you get attributes returned. Well, this is what happens when you're the Dr. Who of a project: there is a lot of unspoken features in scientific code. Fortunately, Open Science is ideal here and you have all the gory details of the code base when you start dissecting it. Some original design decision simply don't work anymore. The design was not bad; it's just that the expectations have changed.

This highlights the importance of NWO Open Science grants: open research output needs maintenance. Ask a librarian, they know how this works too.

Previous updates

chem-bla-ics

Pages

Sunday, May 15, 2022

BridgeDb NWO grant update #3: Pandora's box

No comments:

Post a Comment