Pages

Sunday, January 24, 2021

new: "A protocol for adding knowledge to Wikidata: aligning resources on human coronaviruses"

Figure 1 from the BMC Biology paper.
A quick bit of history first: 1. I've been contributing and using Wikidata for some time now, e.g. to support the BridgeDb and WikiPathways projects, and using projects like the Chemistry Development Kit and Bioclipse; 2. more or less since March 6th working from home (some periods quarantined); 3. freaking out, like many others; and 4. tried to figure out how to contribute to our response to SARS-CoV-2, leading to a draft WP4846 which you can read about in this paper. I blogged recently more details of this history.

One of the things we needed when talking about SARS-CoV-2 was identifiers. This is a key aspect of interoperability and reuse. Then, these identifiers could be used to annotate literature. It turned out that UniProt did not have identifiers for the virus proteins yet (understandable), so I turned to Wikidata. Quickly, I noted I was not the only one with these ideas:

Tiago Lubiano (a PhD candidate to follow) had already done some work. Now, Andra Waagmeester has been writing Wikidata bots for some years (leading also to the above linked eLife paper), and I asked him about a bot. That led from one thing to another, and Andra saw the opportunity to work out something he had been arguing with me for a long time: we need a clear protocol for adding information to Wikidata. Of course, he insisted on using shape expressions (ShEx) and one thing led to another.

By the time of the spring ELIXIR Virtual COVID19 BioHackathon (April), the prototype was ready (April 7): in just two weeks we had basic shapes, a bot, the first new protein identifiers in Wikidata with references (where possible, also thx to a pre-release of UniProt with planned virus protein identifiers). But not just for SARS-CoV-2, but for all human coronaviruses: that's the power of FAIR data and automation. It scales up. Nothing prevents us from adding protein and gene identifiers to Wikidata for whole families of viruses. With the automated peer review of the shapes.

Over summer we tweaked things, after the reviews from BMC Biology we improved and finetuned things even further, and the result can now be read online. Now it is up to me, to start to apply this protocol to my other work to add information to Wikidata too.

No comments:

Post a Comment