Saturday, August 29, 2020

What is wrong with FAIR today.

Image of kevlar that has nothing to do with this blog post,
except that it is Openly licensed. Source: cacyle 2005, GFDL.

In the past year, we have been working in the NanoSafety Cluster on FAIR research output (for our group, via NanoCommons, RiskGONE, NanoSolveIT, collaborating with other projects, such as ACENano and particularly Gov4Nano), analyzing resources, deciding where the next steps are. Of course, in the context of GO FAIR (e.g. via the Chemistry Implementation Network), ELIXIR, RDA, EOSC, etc.

But there seems to be something going wrong. For example, adoption by some Open Science communities making FAIR the highest priority (but formally FAIR != Open; arguably it should be: O'FAIR), but also in the strong positioning of the data steward that will make research data FAIR. I never felt too comfortable, but we're about to submit an article discussing this. I like to stress, this is not about how to interpret the guidance. The original paper defines these pretty well, and the recent interpretations and implementation considerations give a lot of context.

What is wrong with FAIR today is that we are loosing focus. The real aim is reuse of data. Therefore, FAIR data without an Open license is silly. Therefore, data that cannot be found does not help. Therefore, we want clear access, allowing us to explain our Methods sections properly. Therefore, interoperability, because data that cannot be understood (in enough detail) is useless.

On the Data stewardship

So, when EOSC presents essential skills, I find it worrying that data stewardship is separated from research. I vehemently disagree with that separation. Data stewardship is a core activity of doing research and something is seriously wrong if that is left to others. It will be p-hacking discussions for the next 100 years.

On the Open license manager

A second major problem is one important missing skill: an open license manager. For this one, I'm perfectly fine leaving that to specialists. The license, after all, does not effect the research. But not having this explicitly in the diagram violates our Open Science ideas (e.g. inclusiveness, collaboration, etc).

Not having open license as core of Open Science is just development of more paywalled Open Science. Look, there is time and place for closed data, but that it totally irrelevant. Bringing up that argument is a fallacy. (If you are a scholar and disagree, you just created an argument that open license management should be a core task of a researcher.)

Computable figures. eLife's Executable Research Article

When I did my PhD and wrote my articles, Ron Wehrens introduced me to R as open source alternative to MatLab which was the standard in the research group otherwise. At some point, I got tired of making new plots, and I started saving the R code to make the plot. I still have this under version control (not public; will be release 70 years after my death; I mean, that's still the industry standard at this moment </sarcasm>).

Anyway, I'm delighted that the published behind eLife keeps on innovating and introduced their Executable Research Article. The idea live figures still excited me very much and you can find many examples of that in my blog, and we actively use it in our Scholia project (full proposal). In fact, I still teach Maastricht University students this too, in the Maastricht Science Programme PRA3006 course.

I really wish we had something like this at BMC too, because I'm sure a good number of Journal of Cheminformatics authors would be excited with such functionality. This is their workflow:

Workflow of publishing an ERA in eLife. Image license: CC-BY, source.

One of the tools they mention is Stencila which I really need to look at in detail. It is the kind of Open Science infrastructure that universities should embrace. I'm also excited to see that citation.js is mentioned in the source code, one of the projects Lars Willighagen has been working on, see this publication.

Monday, August 17, 2020

Research line: Interactions and effects of nanomaterials and living matter


Because it is hard to get funded
for interdisciplinary work, in a domain
that does not regular publish in glossy
journals, I found my funding not with
national, but with the European
As a multidisciplinary researcher I had to wait long to become an expert. Effectively, you have to be expert in more than one field. An, as I am experiencing right now, staying expert is a whole other dimension. Bit with a solid chemistry, computing science, cheminformatics, and chemometrics education, I found myself versatile that I at some point landed that grant proposal (the first few failed). Though, I have had my share of travel grants and Google Summer of Code projects (microfunding).

So, while I am trying to establish a research line in metabolomics (also with some microfunding) and one PhD candidate (own funding), my main research line is nanosafety. Because my background fits in well, and while data quality for predictive toxicology leaves to be desired, there is a lot of work we can do here, to make the most of what is being measured.

Indeed, there are many interesting biological, chemical, and chemo-/bioinformatics research questions here (just to name a few):

  • does the mechanism of cell entry differ for different engineered nanomaterials?
  • does it differ from how "natural" nanomaterials enter the cell?
  • does the chemical composition of the nanomaterial change when it comes into contact with living matter? (yes, but how? is it stable?)
  • how do we represent the life cycle of nanomaterials in a meaning full way?
  • does each cell type respond in the same way to the same material? is this predominantly defined by the cell's epigenetics or by the chemical nature of the material?
  • given the sparseness of physicochemical and biological characterization of nanomaterials, what is the most appropriate representation of a material: based on physicochemical description, ontological description, or chemical graph theory?
  • can ontologies help us group data from different studies to give an over view of the whole process from molecular initiating event to adverse outcome?
  • can these insights be used to reliably and transparently advice the European people about risk?
We try to define answers to these questions in a series of FP7/H2020 projects using an Open Science approach, allowing our analyses to be updated frequently when new data or new knowledge comes in. These are the funded projects for which I am (was) PI:
  • eNanoMapper (EC FP7, ended, but since this project developed Open solutions, it is reused a lot)
  • NanoCommons (EC H2020, our work focussing on continuing the common ontology)
  • RiskGONE (EC H2020, focusing on reach regulatory advising based on scientific facts)
  • NanoSolveIT (EC H2020, computational nanosafety)
  • Sbd4Nano (EC H2020, disseminating nanosafety research to the EU industry)
In these projects (two PhD candidates, one postdoc), open science has been important to what we do. And while not all partners in all projects use Open Science approaches, what our groups does tries to be as Open as possible. Some open science projects involved:
If we want to read more about these projects in the scientific literature, check the websites which often have a page with publications and deliverables. Or check my Google Scholar profile. And for an overview of our group, see this page.

Friday, July 31, 2020

New Editorial: "Adoption of the Citation Typing Ontology by the Journal of Cheminformatics"

My first blog post about the Citation Typing Ontology was already more than 10 years go. I have been fascinated with finally being able to add some semantics to why we cite a certain article. For years, I had been tracking why people were citing the Chemistry Development Kit articles. Some were citing the article because the Chemistry Development Kit was an important thing to mention, while other articles cited it because they actually used the Chemistry Development Kit. I also started using CiTO predicates in RDF models, and you might find them in various ongoing semantic web projects.

Unfortunately, scholarly publishers did not show much interest. One project that did, was CiteULike. I had posted it as a feature request and it was picked up by CiteULike, something I am still grateful for. CiteULike also no longer exists, but I had a lot of fun with it while it existed:
  1. CiteULike CiTO Use Case #1: Wordles
  2. CiTO / CiteULike: publishing innovation
But I like to also stress it has more serious roles in our scientific dissemination workflow:
  1. "What You're Doing Is Rather Desperate"
So, I am delighted that we are now starting a pilot with the Journal of Cheminformatics to use CiTO annotation at the journal side. You can read it in this new editorial.

It is a first step of a second attempt to get CiTO of the ground. Had CiteULike still existed, this would have been a wonderful mashup, but Wikidata might be a good alternative. In fact, I already trialed a data model and developed several SPARQL queries. Support in Scholia is a next step on this front.

Now, citation networks in general have received a lot of attention. And with projects like OpenCitations we increasingly have access to this information. That allows visualisation, for example, with Scholia, here for the 2010 paper:

More soon!

For now, if you like to see the CiTO community growing too, please tweet, blog, message your peers about our new editorial:

Willighagen, E. Adoption of the Citation Typing Ontology by the Journal of CheminformaticsJ Cheminform 12, 47 (2020).

Tuesday, July 28, 2020

new paper: "Risk Governance of Emerging Technologies Demonstrated in Terms of its Applicability to Nanomaterials"

Design of the Council and the
processes around it.
In April I reported about a paper outlining NanoSolveIT and another paper outlining plans have come out, this time detailing the Risk Governance Council which the European Commission asked three H2020 projects to be set up. One is RiskGONE where our group is involved in, and the other two are Gov4Nano and NANORIGO:

Isigonis P, Afantitis A, Antunes D, Bartonova A, Beitollahi A, Bohmer N, et al. Risk Governance of Emerging Technologies Demonstrated in Terms of its Applicability to Nanomaterials. Small. 2020 Jul 23;2003303. 10.1002/smll.202003303

What I personally hope we will achieve with this Council, is that all our governance is linked clearly via FAIR data, provenance, reproducible research, to underlying experiments. This requires many FAIR approaches, something that we have been working together closely between RiskGONE and Gov4Nano already in the NanoSafety Cluster WGF.

Of course, all this requires good data (think NanoCommons) and good computation (think NanoSolveIT).

Sunday, July 12, 2020

Journals performance, individual articles, assessment and ranking, and researchers

Sign here. Image: CC-BY-SA.
Okay, there it is: journals performance, individual articles, assessment and ranking, and researchers. It has it all. Yes, it is journal impact factor season.

Most scholars know now when and when not to use the impact factor. But old habits die slowly and the journals impact factor (JIF or IF) is still used a lot to rank journals, rank universities, rank articles, rank researchers.

I signed DORA, but that does not mean I do not know that the (change year over year of the) IF hints at how a journal is doing. Yes, an median is better than an average. A citation count distribution is even better. After all, a stellar IF still means that tens of percent points of the articles in the same period are not or just once cited.

One striking voice was angry the Journal of Cheminformatics tweeted its new IF. We did not do so without internal discussion and deliberation. Readers of the journal know we do not mention the IF on our front page (as many journals) do. We are working on displaying the citation distribution on a subpage of the website. And we want authors to submit to our journal because we value Open Science and have a reviewer that value that too. We want articles in our journal to be easily reproduced.

But I know reality. I know many researchers are still expected to report IFs along with their articles. I am one of them (in the past 8 years, articles in a journal with IF>5 were "better"). I've been objecting against it for many years, and fortunately there is a path away from them in The Netherlands. If you must rank articles and researchers, then rank them according to their own work, and not based on the work of others. So, I decided that I had no objection against tweeting the J. Cheminform. IF.

Interestingly, if you really want to push this, you should also not mention journal names in your publication list. Let the scholars ranting against the IF but still cheering a Nature, Cell, Science (etc) article rethink their reasoning.

So, what should we do? How should we move forward. Of course I have some ideas about this. Just (re)read my blog. Progress is slow. But I ask everyone who rants about the IF to not just propose better solutions, but actively disseminate them. Implement that solutions and get other people to use it. For example, send your journal an open Letter to the Editor to make a clear statement against the use of IF as a reason to publish in that journal.

If that is too much for you, at least sign DORA and ask your peers to do so too.

Thursday, July 02, 2020

Bioclipse git experiences #2: Create patches for individual plugins/features

Carrying around git patches is hard work.
Source: Auckland War Memorial Museum, CC-SA.
This is a series of two posts repeating some content I wrote up back in the Bioclipse days (see also this Scholia page). They both deal with something we were facing: restructuring of version control repositories, while actually keeping the history. For example, you may want to copy or move code from one repository to another. A second use case can be a file that must be removed (there are valid reasons for that). Because these posts are based on Bioclipse work, there will be some specific terminology, but the approach I regularly apply in other situations.

This second post talks about how to migrate code from one repository to another.

Create patches for individual plugins/features

While the above works pretty well, a good alternative in situations where you only need to get a repository-with-history for a few plugins, is to use patch sets.
  • first, initialize a new git repository, e.g. bioclipse.rdf:
 mkdir bioclipse.rdf
 cd bioclipse.rdf
 git init
 nano README
 git commit -m "Added README with some basic info about the new repository" README
  • then, for each plugin discover you need what the commit was where the plugins was first commited, using the git-svn repository created earlier:
 cd your.gitsvn.checkout
 git log --pretty=oneline externals/com.hp.hpl.jena/ | tail -1
  • then create patches for the last tree before that last patch by appending '^1' to the commit hash. For example, the first patch of the Jena libraries was 06d0eb0542377f958d06892860ea3363e3316389, so I type:
 rm 00*.patch
 git format-patch 06d0eb0542377f958d06892860ea3363e3316389^1 -- externals/com.hp.hpl.jena
(tune the filter when removing old patches if there are more than 99!)
The previous two steps can be combined into a Perl script:
use diagnostics;
use strict;

my $plugin = $ARGV[0];

if (!$plugin) {
  print "Syntax: gfp <plugin|feature>\n";

die "Cannot find plugin or feature $plugin !" if (!(-e $plugin));

`rm -f *.patch`;
my $hash = `git log --follow --pretty=oneline $plugin | tail -1 | cut -d' ' -f1`;
$hash =~ s/\n|\r//g;

print "Plugin: $plugin \n";
print "Hash: $hash \n";
`git format-patch $hash^1 -- $plugin`;
  • move these patches into your new repository:
 mv 00*.patch ../bioclipse.rdf
(tune the filter when moving the patches if there are more than 99! Also customize the target folder name to match your situation)
  • apply the new patches in your new git repository:
 cd ../bioclipse.rdf
 git am 00*.patch
(You're on your own if that fails... and you may have to default to the other alternative then)
  • repeat those two steps for all plugins you want in your new repository

Bioclipse git experiences #1: Strip away unwanted plugins

This is a series of two posts repeating some content I wrote up back in the Bioclipse days (see also this Scholia page). They both deal with something we were facing: restructuring of version control repositories, while actually keeping the history. For example, you may want to copy or move code from one repository to another. A second use case can be a file that must be removed (there are valid reasons for that). Because these posts are based on Bioclipse work, there will be some specific terminology, but the approach I regularly apply in other situations.

For this first post, think of a plugin as a subfolder, tho it even applies to files.

Strip away unwanted plugins

  • then you remove everything you do not want in your new git repository. Do:
 git clone --bare --no-hardlinks old.local.clone/ new.local.clone/
then use:
 git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch plugins/net.bioclipse.actionHistory plugins/net.bioclipse.analysis' HEAD
It often happens that you need to run the above command several times, in cases when there are many subdirectories to be removed.
When you removed all the bits you need removed, you can clean up the repository and reduce the size considerably with:
 git repack -ad; git prune

Thursday, May 07, 2020

new project: "COVID-19 Disease Maps"

Project logo by Marek Ostaszewski.
Already started a few weeks ago, but the COVID-19 Disease Maps project now has a sketch published, outlining the ambitions: COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms  (doi:10.1038/s41597-020-0477-8).

I've been focusing on the experimental knowledge we have about the components of the SARS-CoV-2 virion and how they interact with the human cell. I'm at least two weeks behind on reading literature, but hope to catch up a bit this week. The following diagram shows one of the pathways on the WikiPathways COVID-19 Portal:

wikipathways:WP4846, CC0
This has led to collaborations with Andra Waagmeester, Jasper Koehorst and others, resulting in this preprint that needs some tweaking before submission, to an awesome collaboration with Birgit Meldal around the Complex Portal (preprint pending), and a Japanese translation of a book around a number search queries against Wikidata (CC-BY/CC0). The latter two were started at the recent online BioHackathon.

Oh, boy, do I love Open Science.

Monday, April 27, 2020

new paper: "NanoSolveIT Project: Driving nanoinformatics research to develop innovative and integrated tools for in silico nanosafety assessment"

Fig. 1. Schematic overview of the workflow for toxicogenomics
modelling and how these models feed into the subsequent
materials modelling and IATA. Open Access.
NanoSolveIT is a H2020 project that started last year. Our BiGCaT group is involved in the data integration to support systems biology part of the Integrated Approaches to Testing and Assessment (IATA) for engineered nanomaterials in Work Package 1. This paper gives an overview of the project, the work, and the goals.

Of course, doing this is not trivial at all. And we have to bridge a lot of different research data, concepts, etc. As such, it is clear how it relates to the other nanosafety projects we have been involved in, such as eNanoMapper, NanoCommons, and RiskGONE.

Afantitis A, Melagraki G, Isigonis P, Tsoumanis A, Danai Varsou D, Valsami-Jones E, et al. NanoSolveIT Project: Driving Nanoinformatics research to develop innovative and integrated tools for in silico nanosafety assessment. Computational and Structural Biotechnology Journal. 2020 Mar;S2001037019305112. 

Sunday, March 29, 2020

Tackling SARS-CoV-2 with big data

This blog post will contain a translation I made of this short "our story" Coronavirus te lijf met big data at the MUMC+ website written by Andr√© Leblanc. The Maastricht University Medical Center+ (MUMC+) is a collaboration of our our Maastricht University Faculty of Health, Medicine, and Life Science, of which our BiGCaT research group is part.

Wikidata is a community project and I only use and contribute to it. Scholia is a project started by Finn Nielsen (Technical University of Denmark - DTU), and now has funding from the Alfred P. Sloan Foundation, coordinated by Daniel Mietchen and Lane Rasberry (University of Virginia). Further acknowledgements to Andra Waagmeester (Micelio) and Jasper Koehorst (Wageningen University) for a great collaboration on corona virus information (see also Wikidata:WikiProject_COVID-19). WikiPathways colleagues including, of course, Prof. Chris Evelo and Dr. Martina Kutmon in Maastricht, but also Dr. Alex Pico and others in San Francisco. For me it was one of the selling points of the research group when I joined in 2012.

Tackling the corona virus with big data

Scholars around the world are working relentlessly on the development of a vaccine against the new SARS-CoV-2 coronavirus. Chemist and assistant professor Egon Willighagen contributes in collaboration with colleagues at the BiGCaT Department of Bioinformatics in Maastricht to make data and knowledge easier to find for other scholars. How does that work?

Big data is the new buzz words in the scholarly community. For example, collecting worldwide data around the treatment of cancer, and extracting from the best personal, unique treatment. In the case of the new coronavirus there is a more general need to just have access to data. Since the virus outbreak in Wuhan, China, there has been an explosion of new research articles on the COVID19 and the causing SARS-CoV-2 virus. The total number of scientific publications about corona viruses itself has reached some 29 thousand. These are not only about the new virus, but also the corona viruses that roamed the world before, like SARS and MERS. Either way, this makes it practically impossible to read all these articles. Instead, access to this literature has to be provided in a different way, allowing researchers to find the knowledge and data they need for their research.

Willighagen does this by organizing scientific literature, linking information, and filtering the collection of data and publications, making it searchable for scholars. He annotates publications with search terms and author names, and uses unique, global identifiers (like personal identification numbers) to support this. This is not unlike the use of phone numbers or dictionaries.

Various tools

Wikidata is the database used by Willighagen to link the information resources, along with Scholia to visualize the results. For example, Wikidata organizes data around the new virus with the entry. Willighagen uses these two tools to visualize what this database knows about specific topics.

Research can take advantage of a new open access resource edited by Willighagen: Also social media are used: Twitter is used to increase awareness and mobilize people. Willighagen: "That is from a personal motivation. I tweet articles that show important changes. Or if they emphasize aspects that show how unique and urgent the situation". And finally there is WikiPathways, a project initiated by colleagues of Willighagen, to collect even more specific knowledge about the COVID19 virus. Here's the pathway about the SARS-CoV-2 virion:

Thursday, March 19, 2020

new paper: "Wikidata as a knowledge graph for the life sciences"

A figure from the article, outlining the idea
of using SPARQL queries to extract data
from the open knowledge base.
As a reader of my blog, you know I have been doing quite some research where Wikidata has some role. I am preparing a paper on the work I have done around chemicals in Wikidata, based on what I presented at the ICCS with a poster. So, I was delighted when Andra and Andrew asked me to contribute to a paper outline the importance of Wikidata to the life sciences. The paper was published in eLife, which I'm excited about to, as they do a significant amount of publishing innovation.

I'll keep this post brief, as I have plenty of work to do, among which is SARS-CoV-2 data in Wikidata. Join this project, after you read the paper: Wikidata as a knowledge graph for the life sciences (doi:10.7554/eLife.52614, or in Scholia):

I'll write up some more queries for this eBook now: Wikidata Queries around the SARS-CoV-2 virus and pandemic.

Sunday, March 15, 2020

SARS-CoV-2, stuck at home, flu, and snowstorms

Scholia linking articles about the COVID19 disease.
Okay, okay, the snowstorm was ten years ago, when we were living in Sweden. We had two snowstorms, each time stuck at home, unable to leave our house. That was okay. We knew the next days the streets were cleaned, and we could continue living our lives.

Now it's different. I've been in 'social distancing' mode since the evening of Friday the 6th, so a bit over a week now. Because I have a flu. Presumably. Testing for SARS-CoV-2 is not routinely done and saved for risk groups and patients with severe COVID19 symptoms.

But the current situation is once in a lifetime. In the bad way. My generation has not had a situation like this yet. A real national emergency. But The Netherlands is coping. The data is scary. The situation in North Italy shows that humans are humans, and the virus doesn't care where it is surviving. It is how each country deals with it. And let me make clear, we must be learning from the countries that have been in the fire line already.

(North) Italy has a health care system in the top 5% according to OECD guidelines. Still, they were taken by surprise. But even the warned countries have been hesitant. The discussion is complex. A smaller economy (a 1% shrink is estimated right now) also means (as a Dutch professor pointed out 2, 3 days ago) there is less tax money to spend on the health care system.

Sad fact is, where are no longer talking about how to stop SARS-CoV-2. We are now talking about minimizing the number of causalities. A storm it is.

Keep safe, keep electronically in contact with the people around you (mental health), and foremost, wash your hands and practice social distancing. Let the storm not grow much further. This storm is not over the next morning. We're in for a rough ride.

Saturday, January 25, 2020

MetaboEU2020 in Toulouse and the ELIXIR Metabolomics Community assemblies

This week I attended the European RFMF Metabomeeting 2020, aka #MetaboEU2020, held in Toulouse. Originally, I had hoped to do this by train, but that turned out unfeasible. Co-located with this meeting where ELIXIR Metabolomics Community meetings. We're involved in two implementation studies for together less than a month of work. But both this community and the conference are great places to talk about WikiPathways, BridgeDb (our website is still disconnected from the internet), and cheminformatics.

Toulouse was generally great. It comes with its big city issues, like fairly expensive hotels, and a very frequent public transport system. It also had a great food market where we had our "gala dinner". Toulouse is also home to Airbus, so it was hard to miss the Beluga:

The MetaboEU2020 conference itself had some 400 participants, of course, with a lot of wet lab metabolomics. As a chemist, with a good pile of training in analytical chemistry, it's great to see the progress. From a data analysis perspective, the community has a long way to come. We're still talking about known known, unknown knowns, and unknown unknowns. The posters were often cryptic, e.g. stating they found 35 interesting metabolites, without actually listing them. The talks were also really interesting.

Now, if you read this, there is a good chance you were not at the meeting. You can check the above linked hashtag for coverage on Twitter, but we can do better. I loved Lanyrd, but their business model was not scalable and the service no longer exists. But Scholia (see doi:10.3897/rio.5.e35820) could fill the gap (it uses the Wikidata RDF and SPARQL queries). I followed Finn's steps and created a page for the meeting and started associated speakers (I've done this in the past for other meetings too):

Finn also created proceedings pages in the past, which I also followed. So, I asked people on Twitter to post their slidedeck and posters on Figshare or Zenodo, and so far we ended up with 10 "proceedings" (thanks to everyone who did!!!):

As you can see, there is an RSS feed which you can follow (e.g. with Feedly) to get updates if more materials appears online! I wish all conferences did this!

Thursday, January 16, 2020

Help! Digital Object Identifiers: Usability reduced if given at the bottom of the page

The (for J. Cheminform.) new SpringerNature article template has the Digital Object Identifier (DOI) at the bottom of the article page. So, every time I want to use the DOI I have to scroll all the way down to the page. That could be find for abstracts, but totally unusable for Open Access articles.

So, after our J. Cheminform. editors telcon this Monday, I started a Twitter poll:

Where I want the DOI? At the top, with the other metadata:
Recent article in the Journal of Cheminformatics.
If you agree, please vote. With enough votes, we can engage with upper SpringerNature manager to have journals choose where they want the DOI to be shown.

(Of course, the DOI as semantic data in the HTML is also important, but there is quite good annotation of that in the HTML <head>. Link out to RDF about the article, is still missing, I think.)