Monday, November 30, 2020

CiTO updates #3: third paper in the collection and updated Scholia patch

Last week the third paper got published in the Citation Typing Ontology Collection and this weekend I finished adding the citation annotations to Wikidata.

While the number of papers in the Journal of Cheminformatics is only slowly growing, the number of journals receiving annotated citations is growing faster. And there are 70 now:
The Scholia patch needed for this updated table is not online yet.

Saturday, November 28, 2020

new paper: "WikiPathways: connecting communities"

The number of revisions and contributors
for all pathways in the human pathway
analysis collection.
The last WikiPathways was already 3 years ago, an often used frequency for Nucleic Acids Research updates. So, time for an update, and what an updates we had: WikiPathways: connecting communities (doi:10.1093/nar/gkaa1024). This update focuses on the open, collaborative nature of WikiPathway and on the growing role of the portals, like the lipids portal, the AOP portal, the nanomaterials portal, and the inborn errors of metabolism (IEM) portal. There is also a lot happening in the background, to make our tools better (much needed), our curation support better (in the future available in multiple ways), our data model better, and our dissemination even better (e.g. with Scholia/Toolforge and nanopublications). A huge thanks to Marvin and Tina to get everything together. Finally, if you haven't recently checked the WikiPathways SPARQL endpoint, read the paper :)

Sunday, November 01, 2020

CiTO updates #2: annotation migration to Wikidata and first Scholia patch

During the time of the editorial about the Journal of Cheminformatics Citation Typing Ontology (CiTO) Pilot I already worked out a model to add CiTO annotation in Wikidata. It looks like this for the first research article with annotation:

Screenshot of CiTO annotation of references in Wikidata:

At the time I also write some SPARQL queries against Wikidata to summaries the current use. There are, for example, at this moment 128 CiTO annotations in Wikidata (with the above model). At this moment the citation intention "uses method in cited work" is currently the most common. And 20 journals now have one or more articles with CiTO annotation, with the Journal of Cheminformatics the most (no surprise).


Next up is to enrich Scholia. This may be a bit tricky at this moment, with the annotation not being very abundant at this moment. However, I have started a patch (WIP, work in progress) to show CiTO information. The first step is an extension to the venue aspect, here in action (locally) for the Journal of Cheminformatics:

Scholia page being developed that shows what CiTO types are being used in J.Cheminform. at this moment, with 'updates the cited work' as most annotation of articles citing J.Cheminform. articles.

What we learn from this bubble graph is that at this moment that 'updates the cited work' is the most common annotation of articles citing J.Cheminform. articles. Similar pages will have to be developed for works, authors, etc.

This Scholia work, btw, was funded by a Alfred P. Sloan under grant number G-2019-11458

CiTO updates #1: first research paper in the Journal of Cheminformatics with CiTO annotation published

After a time of exploration of technical needs, idea, plans, the Journal of Cheminformatics launched its Citation Typing Ontology (CiTO) Pilot this summer (doi:10.1186/s13321-020-00448-1). I am very excited about this, because the CiTO tells us why we are citing literature. We are a very long way away from publishing industry adoption, but we have to start somewhere. Laeeq Ahmed et al. published a few weeks ago the first research article with CiTO annotation of references ("Predicting target profiles with confidence as a service using docking scores")!

Title, author list, and part of the abstract of the article by Ahmed et al.

Title, author list, and part of the abstract of the article by Ahmed et al.

Of course, I also have to show a screenshot of what the annotation actually looks like, so here goes:

References 21-23 from the article, showing the CiTO annotation.

Thanks for the authors for adding these annotations!

Saturday, October 31, 2020

SARS-CoV-2, COVID-19, and Open Science

WP4846 that I started on March 16. It will
see a massive overhaul in the next weeks.
Voices are getting stronger over how important Open Science is. Insiders have known the advantages for decades. We also know the issues in the transition, but the transition has been steady. Contributing to Open Science is simple: there are plenty of project where you can contribute without jeopardizing your own research (funding or prestige). Myself, my small contributions have been done without funding too. But I needed to do something. I have been mostly self-quarantined since March 6, with only very few exception. And I'm so done with it. Like so many other people. It won't stop me wearing masks when I go shopping (etc). 

Reflecting on the past eight months, particularly the last two months have been tough. It's easier to sit at home and in the garden when it is warm outside, light. But for another 7 weeks or so, they days will only get darker. The past two months were also so busy with grant reporting that I did not get around to much else, even with an uncommon long stretch of long working weeks, with about 8 weeks of 70-80 hrs of active work in that period. In fact, the past two weeks, with most of the deadlines past, I had a physical reset, and was happy if I made 40 hrs a week. 

So, where is my COVID-19 work now, where is it going?

Molecular Pathways

First, what did we reach? First, leveraging from the Open Science community I am involved in, I stared collaborating. With old friends and making new friends. I was delighted to see I was not the only one. In fact, Somewhere in May/June, I had to give up following all Open Science around COVID-19, because there was too much.

For example, I was not the only one wanting to describe our slowly developing molecular knowledge of the SARS-CoV-2 virus. While my pathway focused on specifically the confirmed processes for SARS-CoV-2, my colleague Freddie digitized a recent review about other corona viruses. Check out her work: WP4863, WP4864WP4877WP4880, and WP4912. In fact, In fact, so much was done by so many people in such a short time, that the WikiPathways COVID-19 Portal was set up.

Further reading:
  • Ostaszewski M. COVID-19 Disease Map, a computational knowledge repository of SARS-CoV-2 virus-host interaction mechanisms. bioRxiv. 2020 Oct 28; 10.1101/2020.10.26.356014v1 (and unversioned 10.1101/2020.10.26.356014)
  • Ostaszewski M, Mazein A, Gillespie ME, Kuperstein I, Niarakis A, Hermjakob H, et al. COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms. Sci Data. 2020 May 5;7(1):136 10.1038/s41597-020-0477-8
Interoperability with Wikidata

Because I see an essential role for Wikidata in Open Science, and because regular databases did not provide identifiers for the molecular building blocks, we created them in Wikidata. This was essential, because I wanted to use Scholia (see screenshot on the right) to track the research output (something that by now has become quite a challenge; btw, checkout Lauren's tutorial on this). This too was still in March. However, because Scholia itself is a general tool, I needed shortlists of all SARS-CoV-2 genes, all proteins, etc. So, I created this book. It's autogenerated and auto-updated by taking advantage of SPARQL queries against Wikidata. And I am so excited the book has been translated in Japanese, Portugues, and Spanish. The i18n work is thanks to the virtual BioHackathon in April, where Yayamamo bootstrapped the framework to localize the content.

Also during that BioHackathon, we started a collaboration with Complex Portal's Birgit, because the next step was to have identifiers for (bio)molecular complexes. This work is still ongoing, but using a workaround we developed for WikiPathways (because complexes in GPML currently cannot have identifiers), we can now link out to Complex Portal, as visible in this screenshot:

The autophagy initiation complex has the CPX-373 identifier in Complex Portal.

Joining the Wikidata effort is simple. Just visit Wikidata:WikiProject_COVID-19 and find your thing of interest. Because the past two months have been so crowded, I still did not get around to explore the kg-covid-19 project, but sounds very interesting too!

Further reading:
  • Waagmeester A, Stupp G, Burgstaller-Muehlbacher S, Good BM, Griffith M, Griffith OL, et al. Wikidata as a knowledge graph for the life sciences. eLife. 2020 Mar 17;9:e52614. 10.7554/eLife.52614
  • Waagmeester A, Willighagen EL, Su AI, Kutmon M, Labra Gayo JE, Fernández-Álvarez D, et al. A protocol for adding knowledge to Wikidata, a case report. bioRxiv [Internet]. 2020 Apr 7 [cited 2020 Apr 17]; 10.1101/2020.04.05.026336
Computer-assisted data curation

For some years now, I have been working on computer-assisted data curation of WikiPathways, but also Wikidata (for chemical compounds). Once your biological knowledge is machine readable, you can have learn machines to recognize common mistakes. Some are basically simple checks, like missing information. But it gets exciting if we take advantage of linked data, and we can have machines check consistency between two or more resources. The better out annotation, the more powerful this computer-assisted data curation becomes. Chris has been urging me to publish this, but I haven't gotten around to this yet.

As part of my COVID-19 work, I have started making curation reports for specific WikiPathways. To enable this, I worked out how to reuse the testing without JUnit, allowing the tests to be used as a library. That allows creating the reports, but in the future will also allow use directly in PathVisio. A second improvement to the testing stack is that tests are now more easily annotated. That allows specifying tests only to be run for a certain WikiPathways portal.

But a lot remains to be done. I think at this moment I only migrated, perhaps, some 5% of all tests. So, this is very much on my "what is next?" list.

What is next?

There is a lot I need, want, and should do. here are some ideas. Maybe you wan to beat me to it. Really, I don't mind being scooped, when it comes to public health. Here goes:
  1. file SARS-CoV-2 book translation update requests for some recent updates
  2. update the SARS-CoV-2 book with a list of important SNPs
  3. add BioSchemase to the SARS-CoV-2 book for individual proteins, genes, etc
  4. update WP4846 with recent literature
  5. have another 'main subject' annotation round for SARS-CoV-2 proteins
  6. migrate more pathways tests from JUnit into the testing library
  7. write a new test to detect preprints in pathway literature lists and check for journal article versions
  8. finish the Dutch translation of the SARS-CoV-2 book
  9. write a tool to recognize WikiPathways complexes with matches in Complex Portal
  10. write a tool to generate markdown for any WikiPathways with curation suggestions based on content in other resources
  11. develop a few HTML+JavaScript pages to summarize WikiPathways COVID-19 Portal content
Am I missing anything? Tweet me or leave a comment here.

Saturday, October 24, 2020

new paper: "A Semi-Automated Workflow for FAIR Maturity Indicators in the Life Sciences"

Figure 1 from the Nanomaterials article.
In a collaboration orchestrated via the NanoSafety Cluster (NSC) Work Group F on Data Management (WGF is formerly known as WG4) between a few H2020 projects (NanoCommons, Gov4Nano, RiskGONE, and NanoSolveIT), we just published an article about the using of Jupyter Notebooks to assess how FAIR (see doi:10.1162/dint_r_00024) several databases are.

It is important to realize this is not to judge these database, but to provide them with a map of how they can make there databases more FAIR. After all, the notebook explains in detail how the level of FAIR was assessed and what they database can do become from "mature". This is what the maturity indicators are about. In doing so, we also discovered that existing sets of maturity indicators do not always benefit the community, often because they are currently focusing more the F and the A, than the I and the R (see What is wrong with FAIR today.). Another really neat feature is the visual representation of the map, proposed by Serena (now at Transparent Musculoskeletal Research) in this paper (shown on the top right).

I like to thank everyone involved in the project, the NSC projects involved in the discussions (Joris Quik, Martine Bakker, Dieter Maier, Iseult Lynch), Serena for starting this work (see this preprint), Laurent for reviewing the notebook, and Ammar and Jeaphianne for their hard work on finishing the paper into a this, now published, revision (the original paper was rejected).

Navigating the academic system. Or, will I face an evolution or a revolution?

Image from the time of the Batavian Revolution.
Img: Rijksmuseum, public domain.
When I started this blog, I was still a PhD student myself. Now, I am assistant professor ("universitair docent" in Dutch), tenured, and I've acquired funding to fund others researchers. I have not got used to this yet. I don't like this hierarchical system, but am forced to get used to this. So, when I write "new paper", this paper is no longer "mine". I'm more like a sponsor, attempting to give the researchers that work "for" me advice but also the opportunity to do their own thing.

Of course, this is complicated when the funding comes from grants, where the person I start collaborating with, is greatly reduced in their academic freedom. Talking about navigation: balancing scientific impact and project deliverables requires a good amount of flexibility and creativity.

There is a lot happening. The system is broken. Basically, there is no upper limit and as long as selection is based on volume and not quality (#blasphemy), more is better. So, what do I tell and teach the people working on my grants, those that are under my supervision? Do I teach them how to succeed in the academic system, or do I teach them how to do good research? Ideally, those would be the same. But it is not. The system is broken.

The example of this is well-known. Publishing in journals with an high impact factor. That has for at least two decades seen as a measure of quality and has therefore long been used in assessments, at many levels. Of course, the quality of any paper is not a function of the place where it gets published. It can be argued that research is done better for higher impact journals, but I still need to see the data that confirms that. But there is a far more worrying thing, that exactly proves my point that volume is not the same as quality: apparently, it is acceptable to submit inferior work to journals (with a lower impact factor). The system is broken.

The system is breaking down. People are looking for solutions. Some solutions are questionable. Some good solutions are not taken, because of being problematic in the short term. But one way or another, the system is broken and must be fixed. I hope it can go via evolution, but when then establishment is fighting the evolution, a revolution may be the only option left.

Here are some possible evolutions and some revolutions people are talking about.


  1. all research output will get recognized and rewarded
  2. people will stop using fake measures of success like the journal impact factor
  3. journal articles actually in detail describe the experiment (reproducibility)
  1. the journal will stop to exist and the venue will be the research output itself
  2. we start recognizing rewarding research output instead of researchers
  3. research no longer is a competition for funding


Now, where will either happen? In the whole discussion about, for example, cOAlition S people are hiding behind "but then ..., because in X ...". For example, they argue that careers are damaged if they do not publish in high impact journals. This argument values researchers over research (output). As if this person would not find a job without that article. As if the research itself has no value. Sadly, this is partly true. So, folliwing this reasoning, doing a PhD in The Netherlands damages your career too. You can better do it in the USA (or Oxford, Cambridge) in the first place. 

So, what better place to give this evolution or revolution shape, then in a country like the Netherlands, where the research quality (at least on average) is high and only volume is lacking to compete with the Harvards of the world. Without the billions in extra funding, that volume is not going to happen. We managed to do that, by working 60 hours/week instead of 40 hours/week. Adding another 20 hours a week is not going to happen (not without deaths; if you think that's an exaggeration, just check the facts, plz).

Fortunately, The Netherlands has a good track records with revolutions. A good track record in highly impactful research. The country has shown how strong our research is, with the National Plan Open Science, strong involvement in cOAlition S, evolution is very clearly tried. Not all solution equally well, and also with strong opposition from some. Some years ago, some believed it inconceivable that Nature would start allowing Open Access, but the change is coming. Willingly? Well, with the APC they are going to charge, I cannot really say they do so willingly. Evolution, not revolution.

The real evolution/revolution we need, however, is not about open access. It is about fair and open science.

Saturday, October 17, 2020

Posh Publishing and why Recognition and Reward must be without

File:Posh.jpg from WikiCommons.
Earlier this week I read an article in Nature about the high APC being a problem. Yes, the publishing system is very expensive, but we also know that some publisher increase(d) the APC based on demand. Yes, the publishers do not differentiate prices between regions. Yes, the waiver concept, just by the name, is a problem. Publishing in high-impact journals is posh publishing (there is no evidence your article actually becomes more scientifically sound).

The posh publishing is a direct result of human behavior. We like posh. We learn to dress posh, to act posh. This is strongly embedded on Western culture. It's gender independent. We all like posh. The posh-fetish goes deep. Very deep.

Why do we want to be posh. Well, that answer is given in the comment: Why does a high-impact publication matter so much for a career in research? As long as we keep seeing posh as better for your career, plenty of people will be more than happy to pay for it. We're all humans. We're suckers for pain.

Therefore, in the VSNU Recognition and Reward we must compensate for this human behavior. Is that weird? Not at all. Many academic habits are important to overcome human nature. For example, we have to force ourselves to overcome our flaws. What can we do?

  1. recognize and reward all research output (software, data, standards, policies, advises, grant deliverables, open standards)
  2. recognize and reward particularly activities that focus on removing the posh from the science
  3. learn to recognize your flaws, your biases, your poshness
Happy saturday!

Saturday, August 29, 2020

What is wrong with FAIR today.

Image of kevlar that has nothing to do with this blog post,
except that it is Openly licensed. Source: cacyle 2005, GFDL.

In the past year, we have been working in the NanoSafety Cluster on FAIR research output (for our group, via NanoCommons, RiskGONE, NanoSolveIT, collaborating with other projects, such as ACENano and particularly Gov4Nano), analyzing resources, deciding where the next steps are. Of course, in the context of GO FAIR (e.g. via the Chemistry Implementation Network), ELIXIR, RDA, EOSC, etc.

But there seems to be something going wrong. For example, adoption by some Open Science communities making FAIR the highest priority (but formally FAIR != Open; arguably it should be: O'FAIR), but also in the strong positioning of the data steward that will make research data FAIR. I never felt too comfortable, but we're about to submit an article discussing this. I like to stress, this is not about how to interpret the guidance. The original paper defines these pretty well, and the recent interpretations and implementation considerations give a lot of context.

What is wrong with FAIR today is that we are loosing focus. The real aim is reuse of data. Therefore, FAIR data without an Open license is silly. Therefore, data that cannot be found does not help. Therefore, we want clear access, allowing us to explain our Methods sections properly. Therefore, interoperability, because data that cannot be understood (in enough detail) is useless.

On the Data stewardship

So, when EOSC presents essential skills, I find it worrying that data stewardship is separated from research. I vehemently disagree with that separation. Data stewardship is a core activity of doing research and something is seriously wrong if that is left to others. It will be p-hacking discussions for the next 100 years.

On the Open license manager

A second major problem is one important missing skill: an open license manager. For this one, I'm perfectly fine leaving that to specialists. The license, after all, does not effect the research. But not having this explicitly in the diagram violates our Open Science ideas (e.g. inclusiveness, collaboration, etc).

Not having open license as core of Open Science is just development of more paywalled Open Science. Look, there is time and place for closed data, but that it totally irrelevant. Bringing up that argument is a fallacy. (If you are a scholar and disagree, you just created an argument that open license management should be a core task of a researcher.)

Computable figures. eLife's Executable Research Article

When I did my PhD and wrote my articles, Ron Wehrens introduced me to R as open source alternative to MatLab which was the standard in the research group otherwise. At some point, I got tired of making new plots, and I started saving the R code to make the plot. I still have this under version control (not public; will be release 70 years after my death; I mean, that's still the industry standard at this moment </sarcasm>).

Anyway, I'm delighted that the published behind eLife keeps on innovating and introduced their Executable Research Article. The idea live figures still excited me very much and you can find many examples of that in my blog, and we actively use it in our Scholia project (full proposal). In fact, I still teach Maastricht University students this too, in the Maastricht Science Programme PRA3006 course.

I really wish we had something like this at BMC too, because I'm sure a good number of Journal of Cheminformatics authors would be excited with such functionality. This is their workflow:

Workflow of publishing an ERA in eLife. Image license: CC-BY, source.

One of the tools they mention is Stencila which I really need to look at in detail. It is the kind of Open Science infrastructure that universities should embrace. I'm also excited to see that citation.js is mentioned in the source code, one of the projects Lars Willighagen has been working on, see this publication.

Monday, August 17, 2020

Research line: Interactions and effects of nanomaterials and living matter


Because it is hard to get funded
for interdisciplinary work, in a domain
that does not regular publish in glossy
journals, I found my funding not with
national, but with the European
As a multidisciplinary researcher I had to wait long to become an expert. Effectively, you have to be expert in more than one field. An, as I am experiencing right now, staying expert is a whole other dimension. Bit with a solid chemistry, computing science, cheminformatics, and chemometrics education, I found myself versatile that I at some point landed that grant proposal (the first few failed). Though, I have had my share of travel grants and Google Summer of Code projects (microfunding).

So, while I am trying to establish a research line in metabolomics (also with some microfunding) and one PhD candidate (own funding), my main research line is nanosafety. Because my background fits in well, and while data quality for predictive toxicology leaves to be desired, there is a lot of work we can do here, to make the most of what is being measured.

Indeed, there are many interesting biological, chemical, and chemo-/bioinformatics research questions here (just to name a few):

  • does the mechanism of cell entry differ for different engineered nanomaterials?
  • does it differ from how "natural" nanomaterials enter the cell?
  • does the chemical composition of the nanomaterial change when it comes into contact with living matter? (yes, but how? is it stable?)
  • how do we represent the life cycle of nanomaterials in a meaning full way?
  • does each cell type respond in the same way to the same material? is this predominantly defined by the cell's epigenetics or by the chemical nature of the material?
  • given the sparseness of physicochemical and biological characterization of nanomaterials, what is the most appropriate representation of a material: based on physicochemical description, ontological description, or chemical graph theory?
  • can ontologies help us group data from different studies to give an over view of the whole process from molecular initiating event to adverse outcome?
  • can these insights be used to reliably and transparently advice the European people about risk?
We try to define answers to these questions in a series of FP7/H2020 projects using an Open Science approach, allowing our analyses to be updated frequently when new data or new knowledge comes in. These are the funded projects for which I am (was) PI:
  • eNanoMapper (EC FP7, ended, but since this project developed Open solutions, it is reused a lot)
  • NanoCommons (EC H2020, our work focussing on continuing the common ontology)
  • RiskGONE (EC H2020, focusing on reach regulatory advising based on scientific facts)
  • NanoSolveIT (EC H2020, computational nanosafety)
  • Sbd4Nano (EC H2020, disseminating nanosafety research to the EU industry)
In these projects (two PhD candidates, one postdoc), open science has been important to what we do. And while not all partners in all projects use Open Science approaches, what our groups does tries to be as Open as possible. Some open science projects involved:
If we want to read more about these projects in the scientific literature, check the websites which often have a page with publications and deliverables. Or check my Google Scholar profile. And for an overview of our group, see this page.

Friday, July 31, 2020

New Editorial: "Adoption of the Citation Typing Ontology by the Journal of Cheminformatics"

My first blog post about the Citation Typing Ontology was already more than 10 years go. I have been fascinated with finally being able to add some semantics to why we cite a certain article. For years, I had been tracking why people were citing the Chemistry Development Kit articles. Some were citing the article because the Chemistry Development Kit was an important thing to mention, while other articles cited it because they actually used the Chemistry Development Kit. I also started using CiTO predicates in RDF models, and you might find them in various ongoing semantic web projects.

Unfortunately, scholarly publishers did not show much interest. One project that did, was CiteULike. I had posted it as a feature request and it was picked up by CiteULike, something I am still grateful for. CiteULike also no longer exists, but I had a lot of fun with it while it existed:
  1. CiteULike CiTO Use Case #1: Wordles
  2. CiTO / CiteULike: publishing innovation
But I like to also stress it has more serious roles in our scientific dissemination workflow:
  1. "What You're Doing Is Rather Desperate"
So, I am delighted that we are now starting a pilot with the Journal of Cheminformatics to use CiTO annotation at the journal side. You can read it in this new editorial.

It is a first step of a second attempt to get CiTO of the ground. Had CiteULike still existed, this would have been a wonderful mashup, but Wikidata might be a good alternative. In fact, I already trialed a data model and developed several SPARQL queries. Support in Scholia is a next step on this front.

Now, citation networks in general have received a lot of attention. And with projects like OpenCitations we increasingly have access to this information. That allows visualisation, for example, with Scholia, here for the 2010 paper:

More soon!

For now, if you like to see the CiTO community growing too, please tweet, blog, message your peers about our new editorial:

Willighagen, E. Adoption of the Citation Typing Ontology by the Journal of CheminformaticsJ Cheminform 12, 47 (2020).

Tuesday, July 28, 2020

new paper: "Risk Governance of Emerging Technologies Demonstrated in Terms of its Applicability to Nanomaterials"

Design of the Council and the
processes around it.
In April I reported about a paper outlining NanoSolveIT and another paper outlining plans have come out, this time detailing the Risk Governance Council which the European Commission asked three H2020 projects to be set up. One is RiskGONE where our group is involved in, and the other two are Gov4Nano and NANORIGO:

Isigonis P, Afantitis A, Antunes D, Bartonova A, Beitollahi A, Bohmer N, et al. Risk Governance of Emerging Technologies Demonstrated in Terms of its Applicability to Nanomaterials. Small. 2020 Jul 23;2003303. 10.1002/smll.202003303

What I personally hope we will achieve with this Council, is that all our governance is linked clearly via FAIR data, provenance, reproducible research, to underlying experiments. This requires many FAIR approaches, something that we have been working together closely between RiskGONE and Gov4Nano already in the NanoSafety Cluster WGF.

Of course, all this requires good data (think NanoCommons) and good computation (think NanoSolveIT).

Sunday, July 12, 2020

Journals performance, individual articles, assessment and ranking, and researchers

Sign here. Image: CC-BY-SA.
Okay, there it is: journals performance, individual articles, assessment and ranking, and researchers. It has it all. Yes, it is journal impact factor season.

Most scholars know now when and when not to use the impact factor. But old habits die slowly and the journals impact factor (JIF or IF) is still used a lot to rank journals, rank universities, rank articles, rank researchers.

I signed DORA, but that does not mean I do not know that the (change year over year of the) IF hints at how a journal is doing. Yes, an median is better than an average. A citation count distribution is even better. After all, a stellar IF still means that tens of percent points of the articles in the same period are not or just once cited.

One striking voice was angry the Journal of Cheminformatics tweeted its new IF. We did not do so without internal discussion and deliberation. Readers of the journal know we do not mention the IF on our front page (as many journals) do. We are working on displaying the citation distribution on a subpage of the website. And we want authors to submit to our journal because we value Open Science and have a reviewer that value that too. We want articles in our journal to be easily reproduced.

But I know reality. I know many researchers are still expected to report IFs along with their articles. I am one of them (in the past 8 years, articles in a journal with IF>5 were "better"). I've been objecting against it for many years, and fortunately there is a path away from them in The Netherlands. If you must rank articles and researchers, then rank them according to their own work, and not based on the work of others. So, I decided that I had no objection against tweeting the J. Cheminform. IF.

Interestingly, if you really want to push this, you should also not mention journal names in your publication list. Let the scholars ranting against the IF but still cheering a Nature, Cell, Science (etc) article rethink their reasoning.

So, what should we do? How should we move forward. Of course I have some ideas about this. Just (re)read my blog. Progress is slow. But I ask everyone who rants about the IF to not just propose better solutions, but actively disseminate them. Implement that solutions and get other people to use it. For example, send your journal an open Letter to the Editor to make a clear statement against the use of IF as a reason to publish in that journal.

If that is too much for you, at least sign DORA and ask your peers to do so too.

Thursday, July 02, 2020

Bioclipse git experiences #2: Create patches for individual plugins/features

Carrying around git patches is hard work.
Source: Auckland War Memorial Museum, CC-SA.
This is a series of two posts repeating some content I wrote up back in the Bioclipse days (see also this Scholia page). They both deal with something we were facing: restructuring of version control repositories, while actually keeping the history. For example, you may want to copy or move code from one repository to another. A second use case can be a file that must be removed (there are valid reasons for that). Because these posts are based on Bioclipse work, there will be some specific terminology, but the approach I regularly apply in other situations.

This second post talks about how to migrate code from one repository to another.

Create patches for individual plugins/features

While the above works pretty well, a good alternative in situations where you only need to get a repository-with-history for a few plugins, is to use patch sets.
  • first, initialize a new git repository, e.g. bioclipse.rdf:
 mkdir bioclipse.rdf
 cd bioclipse.rdf
 git init
 nano README
 git commit -m "Added README with some basic info about the new repository" README
  • then, for each plugin discover you need what the commit was where the plugins was first commited, using the git-svn repository created earlier:
 cd your.gitsvn.checkout
 git log --pretty=oneline externals/com.hp.hpl.jena/ | tail -1
  • then create patches for the last tree before that last patch by appending '^1' to the commit hash. For example, the first patch of the Jena libraries was 06d0eb0542377f958d06892860ea3363e3316389, so I type:
 rm 00*.patch
 git format-patch 06d0eb0542377f958d06892860ea3363e3316389^1 -- externals/com.hp.hpl.jena
(tune the filter when removing old patches if there are more than 99!)
The previous two steps can be combined into a Perl script:
use diagnostics;
use strict;

my $plugin = $ARGV[0];

if (!$plugin) {
  print "Syntax: gfp <plugin|feature>\n";

die "Cannot find plugin or feature $plugin !" if (!(-e $plugin));

`rm -f *.patch`;
my $hash = `git log --follow --pretty=oneline $plugin | tail -1 | cut -d' ' -f1`;
$hash =~ s/\n|\r//g;

print "Plugin: $plugin \n";
print "Hash: $hash \n";
`git format-patch $hash^1 -- $plugin`;
  • move these patches into your new repository:
 mv 00*.patch ../bioclipse.rdf
(tune the filter when moving the patches if there are more than 99! Also customize the target folder name to match your situation)
  • apply the new patches in your new git repository:
 cd ../bioclipse.rdf
 git am 00*.patch
(You're on your own if that fails... and you may have to default to the other alternative then)
  • repeat those two steps for all plugins you want in your new repository

Bioclipse git experiences #1: Strip away unwanted plugins

This is a series of two posts repeating some content I wrote up back in the Bioclipse days (see also this Scholia page). They both deal with something we were facing: restructuring of version control repositories, while actually keeping the history. For example, you may want to copy or move code from one repository to another. A second use case can be a file that must be removed (there are valid reasons for that). Because these posts are based on Bioclipse work, there will be some specific terminology, but the approach I regularly apply in other situations.

For this first post, think of a plugin as a subfolder, tho it even applies to files.

Strip away unwanted plugins

  • then you remove everything you do not want in your new git repository. Do:
 git clone --bare --no-hardlinks old.local.clone/ new.local.clone/
then use:
 git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch plugins/net.bioclipse.actionHistory plugins/net.bioclipse.analysis' HEAD
It often happens that you need to run the above command several times, in cases when there are many subdirectories to be removed.
When you removed all the bits you need removed, you can clean up the repository and reduce the size considerably with:
 git repack -ad; git prune

Thursday, May 07, 2020

new project: "COVID-19 Disease Maps"

Project logo by Marek Ostaszewski.
Already started a few weeks ago, but the COVID-19 Disease Maps project now has a sketch published, outlining the ambitions: COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms  (doi:10.1038/s41597-020-0477-8).

I've been focusing on the experimental knowledge we have about the components of the SARS-CoV-2 virion and how they interact with the human cell. I'm at least two weeks behind on reading literature, but hope to catch up a bit this week. The following diagram shows one of the pathways on the WikiPathways COVID-19 Portal:

wikipathways:WP4846, CC0
This has led to collaborations with Andra Waagmeester, Jasper Koehorst and others, resulting in this preprint that needs some tweaking before submission, to an awesome collaboration with Birgit Meldal around the Complex Portal (preprint pending), and a Japanese translation of a book around a number search queries against Wikidata (CC-BY/CC0). The latter two were started at the recent online BioHackathon.

Oh, boy, do I love Open Science.

Monday, April 27, 2020

new paper: "NanoSolveIT Project: Driving nanoinformatics research to develop innovative and integrated tools for in silico nanosafety assessment"

Fig. 1. Schematic overview of the workflow for toxicogenomics
modelling and how these models feed into the subsequent
materials modelling and IATA. Open Access.
NanoSolveIT is a H2020 project that started last year. Our BiGCaT group is involved in the data integration to support systems biology part of the Integrated Approaches to Testing and Assessment (IATA) for engineered nanomaterials in Work Package 1. This paper gives an overview of the project, the work, and the goals.

Of course, doing this is not trivial at all. And we have to bridge a lot of different research data, concepts, etc. As such, it is clear how it relates to the other nanosafety projects we have been involved in, such as eNanoMapper, NanoCommons, and RiskGONE.

Afantitis A, Melagraki G, Isigonis P, Tsoumanis A, Danai Varsou D, Valsami-Jones E, et al. NanoSolveIT Project: Driving Nanoinformatics research to develop innovative and integrated tools for in silico nanosafety assessment. Computational and Structural Biotechnology Journal. 2020 Mar;S2001037019305112. 

Sunday, March 29, 2020

Tackling SARS-CoV-2 with big data

This blog post will contain a translation I made of this short "our story" Coronavirus te lijf met big data at the MUMC+ website written by André Leblanc. The Maastricht University Medical Center+ (MUMC+) is a collaboration of our our Maastricht University Faculty of Health, Medicine, and Life Science, of which our BiGCaT research group is part.

Wikidata is a community project and I only use and contribute to it. Scholia is a project started by Finn Nielsen (Technical University of Denmark - DTU), and now has funding from the Alfred P. Sloan Foundation, coordinated by Daniel Mietchen and Lane Rasberry (University of Virginia). Further acknowledgements to Andra Waagmeester (Micelio) and Jasper Koehorst (Wageningen University) for a great collaboration on corona virus information (see also Wikidata:WikiProject_COVID-19). WikiPathways colleagues including, of course, Prof. Chris Evelo and Dr. Martina Kutmon in Maastricht, but also Dr. Alex Pico and others in San Francisco. For me it was one of the selling points of the research group when I joined in 2012.

Tackling the corona virus with big data

Scholars around the world are working relentlessly on the development of a vaccine against the new SARS-CoV-2 coronavirus. Chemist and assistant professor Egon Willighagen contributes in collaboration with colleagues at the BiGCaT Department of Bioinformatics in Maastricht to make data and knowledge easier to find for other scholars. How does that work?

Big data is the new buzz words in the scholarly community. For example, collecting worldwide data around the treatment of cancer, and extracting from the best personal, unique treatment. In the case of the new coronavirus there is a more general need to just have access to data. Since the virus outbreak in Wuhan, China, there has been an explosion of new research articles on the COVID19 and the causing SARS-CoV-2 virus. The total number of scientific publications about corona viruses itself has reached some 29 thousand. These are not only about the new virus, but also the corona viruses that roamed the world before, like SARS and MERS. Either way, this makes it practically impossible to read all these articles. Instead, access to this literature has to be provided in a different way, allowing researchers to find the knowledge and data they need for their research.

Willighagen does this by organizing scientific literature, linking information, and filtering the collection of data and publications, making it searchable for scholars. He annotates publications with search terms and author names, and uses unique, global identifiers (like personal identification numbers) to support this. This is not unlike the use of phone numbers or dictionaries.

Various tools

Wikidata is the database used by Willighagen to link the information resources, along with Scholia to visualize the results. For example, Wikidata organizes data around the new virus with the entry. Willighagen uses these two tools to visualize what this database knows about specific topics.

Research can take advantage of a new open access resource edited by Willighagen: Also social media are used: Twitter is used to increase awareness and mobilize people. Willighagen: "That is from a personal motivation. I tweet articles that show important changes. Or if they emphasize aspects that show how unique and urgent the situation". And finally there is WikiPathways, a project initiated by colleagues of Willighagen, to collect even more specific knowledge about the COVID19 virus. Here's the pathway about the SARS-CoV-2 virion:

Thursday, March 19, 2020

new paper: "Wikidata as a knowledge graph for the life sciences"

A figure from the article, outlining the idea
of using SPARQL queries to extract data
from the open knowledge base.
As a reader of my blog, you know I have been doing quite some research where Wikidata has some role. I am preparing a paper on the work I have done around chemicals in Wikidata, based on what I presented at the ICCS with a poster. So, I was delighted when Andra and Andrew asked me to contribute to a paper outline the importance of Wikidata to the life sciences. The paper was published in eLife, which I'm excited about to, as they do a significant amount of publishing innovation.

I'll keep this post brief, as I have plenty of work to do, among which is SARS-CoV-2 data in Wikidata. Join this project, after you read the paper: Wikidata as a knowledge graph for the life sciences (doi:10.7554/eLife.52614, or in Scholia):

I'll write up some more queries for this eBook now: Wikidata Queries around the SARS-CoV-2 virus and pandemic.

Sunday, March 15, 2020

SARS-CoV-2, stuck at home, flu, and snowstorms

Scholia linking articles about the COVID19 disease.
Okay, okay, the snowstorm was ten years ago, when we were living in Sweden. We had two snowstorms, each time stuck at home, unable to leave our house. That was okay. We knew the next days the streets were cleaned, and we could continue living our lives.

Now it's different. I've been in 'social distancing' mode since the evening of Friday the 6th, so a bit over a week now. Because I have a flu. Presumably. Testing for SARS-CoV-2 is not routinely done and saved for risk groups and patients with severe COVID19 symptoms.

But the current situation is once in a lifetime. In the bad way. My generation has not had a situation like this yet. A real national emergency. But The Netherlands is coping. The data is scary. The situation in North Italy shows that humans are humans, and the virus doesn't care where it is surviving. It is how each country deals with it. And let me make clear, we must be learning from the countries that have been in the fire line already.

(North) Italy has a health care system in the top 5% according to OECD guidelines. Still, they were taken by surprise. But even the warned countries have been hesitant. The discussion is complex. A smaller economy (a 1% shrink is estimated right now) also means (as a Dutch professor pointed out 2, 3 days ago) there is less tax money to spend on the health care system.

Sad fact is, where are no longer talking about how to stop SARS-CoV-2. We are now talking about minimizing the number of causalities. A storm it is.

Keep safe, keep electronically in contact with the people around you (mental health), and foremost, wash your hands and practice social distancing. Let the storm not grow much further. This storm is not over the next morning. We're in for a rough ride.

Saturday, January 25, 2020

MetaboEU2020 in Toulouse and the ELIXIR Metabolomics Community assemblies

This week I attended the European RFMF Metabomeeting 2020, aka #MetaboEU2020, held in Toulouse. Originally, I had hoped to do this by train, but that turned out unfeasible. Co-located with this meeting where ELIXIR Metabolomics Community meetings. We're involved in two implementation studies for together less than a month of work. But both this community and the conference are great places to talk about WikiPathways, BridgeDb (our website is still disconnected from the internet), and cheminformatics.

Toulouse was generally great. It comes with its big city issues, like fairly expensive hotels, and a very frequent public transport system. It also had a great food market where we had our "gala dinner". Toulouse is also home to Airbus, so it was hard to miss the Beluga:

The MetaboEU2020 conference itself had some 400 participants, of course, with a lot of wet lab metabolomics. As a chemist, with a good pile of training in analytical chemistry, it's great to see the progress. From a data analysis perspective, the community has a long way to come. We're still talking about known known, unknown knowns, and unknown unknowns. The posters were often cryptic, e.g. stating they found 35 interesting metabolites, without actually listing them. The talks were also really interesting.

Now, if you read this, there is a good chance you were not at the meeting. You can check the above linked hashtag for coverage on Twitter, but we can do better. I loved Lanyrd, but their business model was not scalable and the service no longer exists. But Scholia (see doi:10.3897/rio.5.e35820) could fill the gap (it uses the Wikidata RDF and SPARQL queries). I followed Finn's steps and created a page for the meeting and started associated speakers (I've done this in the past for other meetings too):

Finn also created proceedings pages in the past, which I also followed. So, I asked people on Twitter to post their slidedeck and posters on Figshare or Zenodo, and so far we ended up with 10 "proceedings" (thanks to everyone who did!!!):

As you can see, there is an RSS feed which you can follow (e.g. with Feedly) to get updates if more materials appears online! I wish all conferences did this!

Thursday, January 16, 2020

Help! Digital Object Identifiers: Usability reduced if given at the bottom of the page

The (for J. Cheminform.) new SpringerNature article template has the Digital Object Identifier (DOI) at the bottom of the article page. So, every time I want to use the DOI I have to scroll all the way down to the page. That could be find for abstracts, but totally unusable for Open Access articles.

So, after our J. Cheminform. editors telcon this Monday, I started a Twitter poll:

Where I want the DOI? At the top, with the other metadata:
Recent article in the Journal of Cheminformatics.
If you agree, please vote. With enough votes, we can engage with upper SpringerNature manager to have journals choose where they want the DOI to be shown.

(Of course, the DOI as semantic data in the HTML is also important, but there is quite good annotation of that in the HTML <head>. Link out to RDF about the article, is still missing, I think.)