Sunday, November 01, 2020

CiTO updates #2: annotation migration to Wikidata and first Scholia patch

During the time of the editorial about the Journal of Cheminformatics Citation Typing Ontology (CiTO) Pilot I already worked out a model to add CiTO annotation in Wikidata. It looks like this for the first research article with annotation:

Screenshot of CiTO annotation of references in Wikidata:

At the time I also write some SPARQL queries against Wikidata to summaries the current use. There are, for example, at this moment 128 CiTO annotations in Wikidata (with the above model). At this moment the citation intention "uses method in cited work" is currently the most common. And 20 journals now have one or more articles with CiTO annotation, with the Journal of Cheminformatics the most (no surprise).


Next up is to enrich Scholia. This may be a bit tricky at this moment, with the annotation not being very abundant at this moment. However, I have started a patch (WIP, work in progress) to show CiTO information. The first step is an extension to the venue aspect, here in action (locally) for the Journal of Cheminformatics:

Scholia page being developed that shows what CiTO types are being used in J.Cheminform. at this moment, with 'updates the cited work' as most annotation of articles citing J.Cheminform. articles.

What we learn from this bubble graph is that at this moment that 'updates the cited work' is the most common annotation of articles citing J.Cheminform. articles. Similar pages will have to be developed for works, authors, etc.

This Scholia work, btw, was funded by a Alfred P. Sloan under grant number G-2019-11458

CiTO updates #1: first research paper in the Journal of Cheminformatics with CiTO annotation published

After a time of exploration of technical needs, idea, plans, the Journal of Cheminformatics launched its Citation Typing Ontology (CiTO) Pilot this summer (doi:10.1186/s13321-020-00448-1). I am very excited about this, because the CiTO tells us why we are citing literature. We are a very long way away from publishing industry adoption, but we have to start somewhere. Laeeq Ahmed et al. published a few weeks ago the first research article with CiTO annotation of references ("Predicting target profiles with confidence as a service using docking scores")!

Title, author list, and part of the abstract of the article by Ahmed et al.

Title, author list, and part of the abstract of the article by Ahmed et al.

Of course, I also have to show a screenshot of what the annotation actually looks like, so here goes:

References 21-23 from the article, showing the CiTO annotation.

Thanks for the authors for adding these annotations!

Saturday, October 31, 2020

SARS-CoV-2, COVID-19, and Open Science

WP4846 that I started on March 16. It will
see a massive overhaul in the next weeks.
Voices are getting stronger over how important Open Science is. Insiders have known the advantages for decades. We also know the issues in the transition, but the transition has been steady. Contributing to Open Science is simple: there are plenty of project where you can contribute without jeopardizing your own research (funding or prestige). Myself, my small contributions have been done without funding too. But I needed to do something. I have been mostly self-quarantined since March 6, with only very few exception. And I'm so done with it. Like so many other people. It won't stop me wearing masks when I go shopping (etc). 

Reflecting on the past eight months, particularly the last two months have been tough. It's easier to sit at home and in the garden when it is warm outside, light. But for another 7 weeks or so, they days will only get darker. The past two months were also so busy with grant reporting that I did not get around to much else, even with an uncommon long stretch of long working weeks, with about 8 weeks of 70-80 hrs of active work in that period. In fact, the past two weeks, with most of the deadlines past, I had a physical reset, and was happy if I made 40 hrs a week. 

So, where is my COVID-19 work now, where is it going?

Molecular Pathways

First, what did we reach? First, leveraging from the Open Science community I am involved in, I stared collaborating. With old friends and making new friends. I was delighted to see I was not the only one. In fact, Somewhere in May/June, I had to give up following all Open Science around COVID-19, because there was too much.

For example, I was not the only one wanting to describe our slowly developing molecular knowledge of the SARS-CoV-2 virus. While my pathway focused on specifically the confirmed processes for SARS-CoV-2, my colleague Freddie digitized a recent review about other corona viruses. Check out her work: WP4863, WP4864WP4877WP4880, and WP4912. In fact, In fact, so much was done by so many people in such a short time, that the WikiPathways COVID-19 Portal was set up.

Further reading:
  • Ostaszewski M. COVID-19 Disease Map, a computational knowledge repository of SARS-CoV-2 virus-host interaction mechanisms. bioRxiv. 2020 Oct 28; 10.1101/2020.10.26.356014v1 (and unversioned 10.1101/2020.10.26.356014)
  • Ostaszewski M, Mazein A, Gillespie ME, Kuperstein I, Niarakis A, Hermjakob H, et al. COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms. Sci Data. 2020 May 5;7(1):136 10.1038/s41597-020-0477-8
Interoperability with Wikidata

Because I see an essential role for Wikidata in Open Science, and because regular databases did not provide identifiers for the molecular building blocks, we created them in Wikidata. This was essential, because I wanted to use Scholia (see screenshot on the right) to track the research output (something that by now has become quite a challenge; btw, checkout Lauren's tutorial on this). This too was still in March. However, because Scholia itself is a general tool, I needed shortlists of all SARS-CoV-2 genes, all proteins, etc. So, I created this book. It's autogenerated and auto-updated by taking advantage of SPARQL queries against Wikidata. And I am so excited the book has been translated in Japanese, Portugues, and Spanish. The i18n work is thanks to the virtual BioHackathon in April, where Yayamamo bootstrapped the framework to localize the content.

Also during that BioHackathon, we started a collaboration with Complex Portal's Birgit, because the next step was to have identifiers for (bio)molecular complexes. This work is still ongoing, but using a workaround we developed for WikiPathways (because complexes in GPML currently cannot have identifiers), we can now link out to Complex Portal, as visible in this screenshot:

The autophagy initiation complex has the CPX-373 identifier in Complex Portal.

Joining the Wikidata effort is simple. Just visit Wikidata:WikiProject_COVID-19 and find your thing of interest. Because the past two months have been so crowded, I still did not get around to explore the kg-covid-19 project, but sounds very interesting too!

Further reading:
  • Waagmeester A, Stupp G, Burgstaller-Muehlbacher S, Good BM, Griffith M, Griffith OL, et al. Wikidata as a knowledge graph for the life sciences. eLife. 2020 Mar 17;9:e52614. 10.7554/eLife.52614
  • Waagmeester A, Willighagen EL, Su AI, Kutmon M, Labra Gayo JE, Fernández-Álvarez D, et al. A protocol for adding knowledge to Wikidata, a case report. bioRxiv [Internet]. 2020 Apr 7 [cited 2020 Apr 17]; 10.1101/2020.04.05.026336
Computer-assisted data curation

For some years now, I have been working on computer-assisted data curation of WikiPathways, but also Wikidata (for chemical compounds). Once your biological knowledge is machine readable, you can have learn machines to recognize common mistakes. Some are basically simple checks, like missing information. But it gets exciting if we take advantage of linked data, and we can have machines check consistency between two or more resources. The better out annotation, the more powerful this computer-assisted data curation becomes. Chris has been urging me to publish this, but I haven't gotten around to this yet.

As part of my COVID-19 work, I have started making curation reports for specific WikiPathways. To enable this, I worked out how to reuse the testing without JUnit, allowing the tests to be used as a library. That allows creating the reports, but in the future will also allow use directly in PathVisio. A second improvement to the testing stack is that tests are now more easily annotated. That allows specifying tests only to be run for a certain WikiPathways portal.

But a lot remains to be done. I think at this moment I only migrated, perhaps, some 5% of all tests. So, this is very much on my "what is next?" list.

What is next?

There is a lot I need, want, and should do. here are some ideas. Maybe you wan to beat me to it. Really, I don't mind being scooped, when it comes to public health. Here goes:
  1. file SARS-CoV-2 book translation update requests for some recent updates
  2. update the SARS-CoV-2 book with a list of important SNPs
  3. add BioSchemase to the SARS-CoV-2 book for individual proteins, genes, etc
  4. update WP4846 with recent literature
  5. have another 'main subject' annotation round for SARS-CoV-2 proteins
  6. migrate more pathways tests from JUnit into the testing library
  7. write a new test to detect preprints in pathway literature lists and check for journal article versions
  8. finish the Dutch translation of the SARS-CoV-2 book
  9. write a tool to recognize WikiPathways complexes with matches in Complex Portal
  10. write a tool to generate markdown for any WikiPathways with curation suggestions based on content in other resources
  11. develop a few HTML+JavaScript pages to summarize WikiPathways COVID-19 Portal content
Am I missing anything? Tweet me or leave a comment here.

Saturday, October 24, 2020

new paper: "A Semi-Automated Workflow for FAIR Maturity Indicators in the Life Sciences"

Figure 1 from the Nanomaterials article.
In a collaboration orchestrated via the NanoSafety Cluster (NSC) Work Group F on Data Management (WGF is formerly known as WG4) between a few H2020 projects (NanoCommons, Gov4Nano, RiskGONE, and NanoSolveIT), we just published an article about the using of Jupyter Notebooks to assess how FAIR (see doi:10.1162/dint_r_00024) several databases are.

It is important to realize this is not to judge these database, but to provide them with a map of how they can make there databases more FAIR. After all, the notebook explains in detail how the level of FAIR was assessed and what they database can do become from "mature". This is what the maturity indicators are about. In doing so, we also discovered that existing sets of maturity indicators do not always benefit the community, often because they are currently focusing more the F and the A, than the I and the R (see What is wrong with FAIR today.). Another really neat feature is the visual representation of the map, proposed by Serena (now at Transparent Musculoskeletal Research) in this paper (shown on the top right).

I like to thank everyone involved in the project, the NSC projects involved in the discussions (Joris Quik, Martine Bakker, Dieter Maier, Iseult Lynch), Serena for starting this work (see this preprint), Laurent for reviewing the notebook, and Ammar and Jeaphianne for their hard work on finishing the paper into a this, now published, revision (the original paper was rejected).

Navigating the academic system. Or, will I face an evolution or a revolution?

Image from the time of the Batavian Revolution.
Img: Rijksmuseum, public domain.
When I started this blog, I was still a PhD student myself. Now, I am assistant professor ("universitair docent" in Dutch), tenured, and I've acquired funding to fund others researchers. I have not got used to this yet. I don't like this hierarchical system, but am forced to get used to this. So, when I write "new paper", this paper is no longer "mine". I'm more like a sponsor, attempting to give the researchers that work "for" me advice but also the opportunity to do their own thing.

Of course, this is complicated when the funding comes from grants, where the person I start collaborating with, is greatly reduced in their academic freedom. Talking about navigation: balancing scientific impact and project deliverables requires a good amount of flexibility and creativity.

There is a lot happening. The system is broken. Basically, there is no upper limit and as long as selection is based on volume and not quality (#blasphemy), more is better. So, what do I tell and teach the people working on my grants, those that are under my supervision? Do I teach them how to succeed in the academic system, or do I teach them how to do good research? Ideally, those would be the same. But it is not. The system is broken.

The example of this is well-known. Publishing in journals with an high impact factor. That has for at least two decades seen as a measure of quality and has therefore long been used in assessments, at many levels. Of course, the quality of any paper is not a function of the place where it gets published. It can be argued that research is done better for higher impact journals, but I still need to see the data that confirms that. But there is a far more worrying thing, that exactly proves my point that volume is not the same as quality: apparently, it is acceptable to submit inferior work to journals (with a lower impact factor). The system is broken.

The system is breaking down. People are looking for solutions. Some solutions are questionable. Some good solutions are not taken, because of being problematic in the short term. But one way or another, the system is broken and must be fixed. I hope it can go via evolution, but when then establishment is fighting the evolution, a revolution may be the only option left.

Here are some possible evolutions and some revolutions people are talking about.


  1. all research output will get recognized and rewarded
  2. people will stop using fake measures of success like the journal impact factor
  3. journal articles actually in detail describe the experiment (reproducibility)
  1. the journal will stop to exist and the venue will be the research output itself
  2. we start recognizing rewarding research output instead of researchers
  3. research no longer is a competition for funding


Now, where will either happen? In the whole discussion about, for example, cOAlition S people are hiding behind "but then ..., because in X ...". For example, they argue that careers are damaged if they do not publish in high impact journals. This argument values researchers over research (output). As if this person would not find a job without that article. As if the research itself has no value. Sadly, this is partly true. So, folliwing this reasoning, doing a PhD in The Netherlands damages your career too. You can better do it in the USA (or Oxford, Cambridge) in the first place. 

So, what better place to give this evolution or revolution shape, then in a country like the Netherlands, where the research quality (at least on average) is high and only volume is lacking to compete with the Harvards of the world. Without the billions in extra funding, that volume is not going to happen. We managed to do that, by working 60 hours/week instead of 40 hours/week. Adding another 20 hours a week is not going to happen (not without deaths; if you think that's an exaggeration, just check the facts, plz).

Fortunately, The Netherlands has a good track records with revolutions. A good track record in highly impactful research. The country has shown how strong our research is, with the National Plan Open Science, strong involvement in cOAlition S, evolution is very clearly tried. Not all solution equally well, and also with strong opposition from some. Some years ago, some believed it inconceivable that Nature would start allowing Open Access, but the change is coming. Willingly? Well, with the APC they are going to charge, I cannot really say they do so willingly. Evolution, not revolution.

The real evolution/revolution we need, however, is not about open access. It is about fair and open science.