Sunday, December 29, 2019

Groovy Cheminformatics for CDK 2.3

Last week I worked on my Groovy Cheminformatics book and made a first release for CDK 2.3 (doi:10.5281/zenodo.3590374). One of the new features is that the Groovy code runs on the command line without having to download or install something first:

The CreateAtom3.groovy example with @Grab instruction.
There is yet plenty to migrate from the older paper version and, for example, the Depiction chapter is still missing.

Anyways, I hope you like it :)

WikidataCon, Google Summer of Code Mentor Summit, OpenRiskNet, and the Beilstein Open Science meeting

OpenRiskNet workshop material.
Autumn has been hard. Due to an unfortunate combination of events, the workload was hardly bearable. Half December it started to slow down, but not enough. That said, I had a pleasant two weeks in October, when I attended four meetings in two weeks time. Busy as well, but inspiring. Normally, I blog about meetings I attend, but I haven't had time for that yet. In fact, that has become more common in the past few years.

Beilstein Open Science
The first in the row was the Beilstein Open Science Symposium. It was the second time for me to attend this meeting. It had a nice line up, but would actually not mind to have a hackathon component too. I spoke at this meeting about Wikidata and Scholia, of course (my slides are available at doi:10.5281/zenodo.3492008). It was good to see updates of various projects around, see this abstract book.

Google Summer of Code Mentor Summit
From the meeting I walked down from the venue down to the train station through the grape vines fields. Very enjoyable. The train to Frankfurt Hbf takes long, and the train to Muenchen was delayed by an hour. So, I missed the formal opening of the Mentor Summit, but in time for the goodies. The meeting was awesome: 15 years of open source projects, with plenty of science projects. I was there as mentor of the NRNB organization, but I proudly wore my first GSoC t-shirt from 2007, impressing even the local organization :) Back then, Alexander worked on chemistry search support for the KDE desktop.

OpenRiskNet Final Workshop
Next up was the OpenRiskNet Final Workshop, held in Amsterdam. Just far enough from home to need a hotel. Because of overlap with the next meeting, I even had to skip the last afternoon session. Running from one meeting to another, I actually had just one day to prepare, which was tight, as I was going to give a workshop. Fortunately, I had plenty of material, and it just came done to putting it in the right order. I decided to try something new, and use Markdown (git: OpenRiskNet/workshop/OntologyWorkshop). I copied the idea of the prev/next buttons from one of the other workshops in the same git repository, but am particularly happy with the Answer button, which toggles the answer (see above screenshot).

The final meeting in this series was WikidataCon (or this Scholia page)in Berlin. My abstract around Wikidata and Scholia was accepted and I joined in the informal hack sessions. The latter resulted in an externally developed Scholia feature (see doi:10.3897/rio.5.e35820), so cannot complain. In case you missed it, Wikidata is booming. In fact, the bigger problem with Wikidata, right now, are its growing pains, repeatedly visible in systems not being able to keep up with the effort. Beside my blog, I can recommend this preprint, if you want to read up on Wikidata. My talk, Cheminformatics to improve Wikidata on chemical compounds,  was actually recorded, and you can watch the video here.

Thursday, November 21, 2019

new preprint: "Wikidata as a FAIR knowledge graph for the life sciences"

Entity diagram of life sciences data in Wikidata.
I do not want to make it a habit to send out blog messages about preprints (you see them on your BiGCaT website) and wait with blogging until the article is formally published, but since I have mentioning this preprint to so many people know, it's likely worth checking out.

I'm happy to have been able to contribute to this story by Andrew Su's team and the many other people involved, as I do think Wikidata is a game changer. Our work lies in the corner of small compounds in Wikidata, as you will have been able to see from various presentations at conferences.

Some further posts about what I have been doing in Wikidata, related to this preprint:

Or just generally search for Wikidata in my blog, because there is a lot more to check up on.

Friday, November 01, 2019

Google Scholar has become a channel of spam

Apparently, some culprits have found a way to escape the otherwise impressive spam detection by Google, and managed to spam Google Scholar:

Now, Google Scholar has a long history of learning with a serious paper looks like, and I'm sure they will learn to recognize this kind of spam too. If not, it will be the end of Google Scholar, I'm afraid.

Monday, October 14, 2019

ChemCuration 2019 Poster Conference: Call for Posters

Twitter profile.
It giet oan! That it a Frisian phrase for something unlike is going to happen, like and particularly related to the Elfstedentocht.

ChemCuration 2019 is a go. The website is online, the Twitter account and hashtag are ready, we got a poster prize, and here is the call for posters!

    On December 3 the first ChemCuration conference will take place. ChemCuration 2019 is a one day, online-only conference around data curation and curated data in the chemistry domain. During the entire conference day, you can participate by tweeting about the poster that you uploaded, along with the meeting hashtag, and responding to questions about your poster in the 24 hours of the conference day. The poster must be available in an online repository (e.g. Zenodo or Figshare) under the CCZero, CC-BY or CC-BY-SA license prior to the conference.

    This is the meeting scope: anything around data curation and curated data of open science data in chemistry. This includes but is not limited to: 1. a new release of curated open data; 2. FAIR metadata around open data; and 3. open source tools for data curation.

    How do I participate in ChemCuration?
    You can participate in this online poster conference by presenting your poster on Twitter
    during the conference day. You do this by first archiving your poster via Figshare or Zenodo,
    with an open license (e.g. CCZero or CC-BY). Then, during the day you tweet an image of
    (part of) your digital poster with the #chemcur2019 hashtag, a short summary, and a link to
    your online poster with its DOI. The archived poster should be a regular A0 poster (WxH =
    841 x 1189 mm or 33.1 x 46.8 in)

    Do I need to register?
    Registration is not obligatory to participate. However, if you would like to be eligible for a poster prize, then registration is required, by Nov. 30th, 2019. The registration form is found at

    More information can be found on the website ( and on Twitter

Wednesday, October 09, 2019

ChemCuration: a small trick to fix the SMILES of glucuronides

Glucuronide functional group.
Now that the ChemCuration 2019 online poster conference is nearing, and my upcoming talks about chemistry in Wikidata (also needing curation), and the much longer process of curation of metabolite (-like) structures in WikiPathways, I decided that something I tweeted earlier this week is actually quite useful, and therefore something I should really write up in my lab notebook.

Glucuronide is an example (biological) functional group. And there are several databases that represent the stereochemistry now always correct. That is an interoperability (and thus FAIR) problem. Correcting this is not trivial, particularly if you have to redraw the same glucuronide group again and again.

So, not looking forward to that, I invested a bit of time to find a SMILES trick. What if I had a SMILES snippet that I could easily copy/paste and attach to the SMILES of the chemical structure it is attached to? Here goes.


I just realized that the original 3 I used can better be a 9, which is less likely to occur in the SMILES of the rest of the molecule. The period at the end is also deliberate. That way, I can just copy past the SMILES of the rest directly after that period. Then I get a disconnected structure, but I only have to put a 9 next to the atom that is binding to the glucuronide. So, let's see the R group is methane, I get:


Now, next stop: CoA and other common biological tags.

Sunday, September 29, 2019

Newspaper "De Volkskrant" trashtalks the Dutch research funder NWO

Headline: "This astronomer took a photo
of a black hole, got famous, and has no
research funding now.
Update: there are two articles, one a news item and the other the interview. I uphold my comments here. They are clearly co-published and should be seen as one event (another intended pun).

I find it very disturbing what role newspapers have in the selection of research to fund. Of course, most of this is indirectly, but claiming that a newspapers decides what is fundable and what is not, is crossing a line. It all started with this interviewWaarom de Nijmeegse astronoom die een foto maakte van een zwart gat nu zonder geld voor onderzoek zit, or 
Deze astronoom maakte een foto van een zwart gat, werd wereldberoemd en zit nu zonder onderzoeksgeld. Interestingly, headlines change arbitrarily and De Volkskrant frequently changes headlines for, I can only assume at this moment, clickbait purposes. This article at least had two headlines, something I will come back to later.

At first, I ignored the article. It is just an interview and the science corner of De Volkskrant frequently has opinions, columns, etc, and the amount of science news is too low anyway (IMHO).

But then there was apparently an uproar on Twitter, which I had missed, by a tweet from Daniël Lakens caught my eye (update: this tweet replies to a tweet about the interview, not the news item; despite both are published in parallel, closely linked, I disagree with the argument that the two are not connected because they are different genre):

As said, it's not a news item, but an interview. By putting it is new Science corner of the newspaper, it is upgraded to news. And Daniël is quite right: not getting a grant is not news. And then we get to the git of the uproar: why is this rejection so special that it deserves to be so prominently published in a national newspaper. Apparently, the research is too big to fail. The change of success too high. The research internationally too glossy to not get funded.

And that puts De Volkskrant in a new role: not covering news, but lobbying for a certain Dutch research agenda. Of course, it's a public secret that this is how it works, but you can be honest about it. Falcko's research is world famous because the newspaper made it.

Now, the reason why this prominent place of this interview is disturbing is a bit complex. The issue at hand has many angles, and the aforementioned positioning of an interview as news is one of them. I will try to cover a few more of them.

The grant model
First, Falcko is not the first to see "excellent" (sic, see this) proposals rejected. It is common knowledge that this is how it works. There is not enough research funding (The Netherlands has been underspending for years now, compared to international agreements, tho this problem is bigger). Research is not done efficiently. Etc, etc. But just an excellent research proposal is not enough anymore. Only a percentage of excellent proposals get funded (I guess about ~10-15% overall, while at least ~20% (rough guess) of all proposals is not significantly different from the top-ranked proposal). That's a fact, and I leave it as an exercise to the reader to look up the appropriate primary literature. De Volkskrant could have done that. I'm looking forward to their investigative journalism article about that.

In this respect, the interview makes a superb sensational story of what research life is nowadays. Not just Falcko, but basically every researchers in The Netherlands. A recent study showed that most researchers work overtime. Personally, I have been working as much as a full professor while only having a assisting professor position. Fairly, part of that goes into outreach, but how else do you have get "headhunted" by a newspaper as being, well, what, important?

Red flags
The interview has a number of red flags for me. Things that trigger some frowning about the necessity of this interview. I'll put in some quote, translate them to English, and comment.

".. qua werkdruk en stress de heftigste periode uit mijn loopbaan"

English: ".. in terms of stress the toughest period in my carreer". This is not news. This is a known problem, and the reason why we have #WOInActie in The Netherlands.

"Want zonder geld om – bijvoorbeeld – jonge onderzoekers in dienst te nemen, kun je geen wetenschap beoefenen."

English: "Because without funding to - for example - hire young researchers, you cannot do research." Also not new, but that's not why I bring this up. Because this is sad and something where we are currently in a crappy situation: everything seems to be based on grant proposals. Why does the interview seem to trash the Dutch research funder NWO, where it could also just as well have trashed the Radboud University for not providing the group with funding and make him dependent on low-chance external funding. This too is a long story, but when I did my chemistry degree and PhD in chemistry after that at this same university, each group always had (at least, as far as I could see) a PhD candidate and post-doc. You could do research without grant funding, but simply not as much.

"de vorige drie aanvragen die we voor dit project hadden ingediend, waren door NWO ook afgewezen."

English: "the past three proposals for this project were also rejected by NWO." Now, that's something. But again, not new. Ever since I started research, more than 20 years ago, this has been the world I have been living in. Some research topics are hot, others are not. The fact that black holes were hot in the past (no pun intended), does not entitle any researcher for unlimited funding in the future. Hotness of topics come and go. It is disturbing that De Volkskrant seems to claim it can decide what it hot and what is not. In fact, of course, they do: put something on the front page, and it becomes hot. But De Volkskrant has not reason to complain if a research funder disagrees.

That is disturbing, a newspaper that decides what is important and what is not. At best, the newspaper presents cases in a proper context, and leaves the conclusion to the reader. Pointing fingers at a funder because your pet project did not get funding this time. Very disturbing. I have only to look at the UK where that ultimately leads.

"Waarom zou je de bal dan afleggen op iemand die (nog) niet in scoringspositie staat?"

Okay, I'll translate this to the meaning, instead of literally. Translate the full paragraph to see where this context came from. English: "Why would one fund research that does not (yet) have a chance of succeeding?" This quote comes from the interviewed researcher directly. This is very naive of this seasoned professor, if you ask me. It degrades him to a whining boy, because the teacher in kindergarten said that now it was time for someone else play with the toy too.

But there is a second dimension to this, one that has been discussed for years and years, and it is unbelievable the interview cannot respond with something better than "Can you explain?". Of course, what I'm hinting at is the "winner takes all" approach which greatly reduces the diversity of research. There is no sound evidence this is how you innovate or benefit society. We all know that many research breakthroughs were not predicted, resulted from chance, etc. Very complex question.

However, the bottom point, any serious funder like NWO knows, is that you need to fund also promising research. In fact, the European Research Council was set up specifically for that. The newspaper knows that and Falcko knows that. This schwalbe is also disturbing and very bad journalism (pun intended).

"de koevoet die we nodig hadden om serieus te worden genomen door onze Amerikaanse collega’s"

English: "the lever we needed to be taken seriously by our American colleagues". I don't think I have to explain the problems with that. Again, there is zero news here. Yes, international collaboration is important. Yes, many Dutch researchers with international collaborations also see grants get rejected. No, the prestige of your collaborators is a thing but does not make your research more important. Etc, etc. This line of trash talk against NWO goes on (how dare they not fund this international collaboration):

"We zitten aan tafel met topuniversiteiten als MIT in de VS"

English: "We are sitting at the table with top universities like the MIT in the U.S.A.". Again, nothing new. Many Dutch researchers collaborate with people at top universities. But it is totally irrelevant. There are so many angles here, that that perhaps explains why a nonsense quote like that made it into this article. First, these top universities. "Top" here is subjective (enough literature about that). Partly, they are "top" because they are "large". Then you see the problem: working with the largest comes easy: there are simply many researchers there to work with, and all of them are eager to work with you, because it adds to their prestige (it makes them even larger).

Size matter, you can argue. This argument has been frustrating research funding for many years now. Where grants are rejected because your IF scores are not large enough, because your list of publications is not inflated because you have many people working in your group. Etc, etc. This is a can of worms so large, YOU CANNOT SPELL RED FLAG BIG ENOUGH.

And that recursion. Oh, sigh. Now, just for a second, couple this argument (no, not really) to the previous one. It effectively says: if big it should get bigger and if it is not big (yet), why should it get bigger. Well, there is enough written in economics about diversification, and I am not an economist, but I hope you see my point.

Okay, there is so much more, but I got grant proposal to write on my Sunday, just like every other researcher in The Netherlands. One last one, related to this:

"Een goed voorstel schrijven voor de EU is iets waarmee je twee maanden fulltime bezig bent."

English: "Writing a good EU proposal takes two months fulltime." I cannot disagree with that. Been there, done that. With a success rate of about 15% this means a year of full time writing for a grant. Falcko should feel himself lucky he has a position where he can take this risk. Most ECRs that are fighting for funding to continue their groundbreaking research do not get the opportunity to free up their schedules. That explains why research money, results in more research money (see that recent study on the NWO grant system).

Nothing to see here. Keep calm, and move on.

Oh, George van Hal, I hope to meet you at the next #WOInActie strike. I can introduce you to a few friends who also regularly work hard, on Sundays, and also see proposals rejected. Van Hal, maybe you found this interview news, but I find it hard to say "even goede vrienden", because this framing of a rejected grant hurts science, and therefore you hurt me. I hope this post explains a bit why that is the case. I'd love to continue talking about it, even if I am not a hotshot black hole photographer.

Tuesday, September 24, 2019

new paper: "The metaRbolomics Toolbox in Bioconductor and beyond"

Forget about Python being the prime data analysis platform: there are plenty of alternatives and R has been one of them. With CRAN, rOpenSci, Bioconductor (doi:10.1186/gb-2004-5-10-r80) the platform has three efforts where you can publish your R work. I think of them as scholarly journals: the peer review is strong with them. Anyways, over the years I did my share of R coding (a good bit of my PhD is written in R) and contributed to a few R packages. Nowadays I don't do a lot of R coding anymore.  (Sorry, genalg users: I know this package needs some serious love, and a huge thank you to those (like Michel Ballings) who have picked up the package!!)

But regarding the packaging, I still contribute my bits. For example, with rWikiPathways and BridgeDbR. So, I happily accepted the invitation to contribute to a paper that was published this week and outlines a ton of R packages that are used in the data analysis of metabolomics data: The metaRbolomics Toolbox in Bioconductor and beyond (doi:10.3390/metabo9100200), led by Jan Stanstrup and Steffen Neumann. And many R packages it discusses indeed! The paper is like an atlas, showing you around in a adventurous world of metabolomics, as clear from this dependency graph of Figure 2:

CC-BY. Figure 2 from the article.

But there is more ongoing. The article, being CC-BY is being rewritten as a book, and I have some work left to do to add BioSchemas to Bioconductor R package web pages, get more packages to use BioSchemas in their package vignettes (so the ELIXIR TeSS can automatically pick them up), and there is some more awesomeness being discussed. Well, that's not there yet, but you can start reading this metaRbolomics bible.

Thanks to everyone involved!

Sunday, August 25, 2019

Finding potential reviewers using Scholia

First, if you like to learn more about Scholia, check this list of previous posts.

Now, yesterday I had to invite reviewers for a submission to the Journal of Cheminformatics. This can be hard, and is harder when more authors are involved, from multiple institutes. Existing tools by publishers (including SpringerNature) do not exceed in detecting possible CoIs. In fact, they already have trouble finding authors with expert knowledge. This is where I come in. But it's easy to overlook possible CoI. Anecdotally, I once send our a review request by accident to a reviewer sitting in the same corridor.

So, I want safety checks. The more, the better. Same institute/city? Better not. Published together in the past three years? Maybe. Currently collaborating? No one checks joined grants. Seriously, we rely on honesty from the reviewers (though open peer review would encourage that honesty even a bit more). But FAIR data can help us here. This is, for example, one reason why I am happy the journal now requires ORCIDs for all authors of a manuscript (see doi:10.1186/s13321-019-0365-4).

Finding potential reviewers using Scholia: a recipe
(orginally published as this twitter thread)

So, I have a set of author ORCIDs of a submitted manuscript, and a list of potential reviewers... how do I know if any two on the two lists have recently worked/published together. First, I can make a WDQS query like to get the items for the ORCIDs (for a published article, not the submission):

I can extend this query to look up and summarize these authors in #Scholia with this query,

This Scholia links shows this page:

This Scholia page immediately shows me which articles these authors wrote together. I can now just add the Wikidata QID of the prospective reviewer and see what and when they co-authors... let's say I have Noel O'Boyle in mind as reviewer, I add ",Q28540731" to the Scholia URL and get,Q32565639,Q57415846,Q28540731:

I immediately see that Noel has not published together with the authors of this manuscript. Of course, I have to realize that Wikidata/Wikicite is not complete, but at least gives me some extra safety check. Second, this also does not take into account if they work at the same institute, or have an academic history, as @Ben_C_J mentioned. It also ignores that Noel works at a company collaborating with PubChem, the project of the authors. For that, a different query approach is needed.

A final note, everyone can check if they are in Wikidata with this Scholia URL pattern:${ORCID} where you replace the last bit with some ORCID, e.g. for me:

Sunday, August 18, 2019

References, citations, and bibliographies. Oh, and tools and formats and APIs.

Give an explorer some tools, and they will study things, they will find new (or better) answers. Every scientist, boy/girl scout, teacher knows this. Give a kid a ball, and they will invent a game. Give them a magnifier, and they will explore a new world.

In the first two, hundreds of years of science, the instruments where physical things, and often the instrument is merely the human brain. In the past 30 years, electronic brains (aka software) has become an increasingly important instrument in software. It's not judgmental, biased, but, of course, only as good as the source code. So, in 1994 I got a new instrument: the Internet (yes, with a capital at the time). One of the things I did at the time was play with new instruments. For example, I played with DocBook. But DocBook did not have BibTeX. So, I wrote BibTex for DocBook. I called it JReferences. It worked for me.

Give an explorer some tools, and they will study things. I got educated and become a scholar.

Now, one thing I love is to show people new instruments (which I do with this blog, for example) and to educate people in the tricks of doing research and being a scholar (~0.5 FTE of my day job). When Lars found an interesting topic, I only had to give him the tools and he would use them. And with time, he started developing new tools, new instruments. Now fairly, he's more dedicated than me, and the tool I want to blog about is so much more well-done than my JReferences :)

Top half of the first PDF page of the article.
So, at some point I realized that it was worth writing it up, and I advised that. And he did. All I had to do is give him the instruments and explain some of the scholarly tricks, and he applied them very well, resulting in this PeerJ Computer Science publication: Citation.js: a format-independent, modular bibliography tool for the browser and command line (doi:10.7717/peerj-cs.214).

Give an explorer some tools, and they will study things, and they will improve our world.

So, Lars gave me a new instrument: citation.js. In the more than two years the tool now exists, I have used it for two things: first, I used it on my website to give references of typical literature. Second, I use it for the Groovy Cheminformatics with the Chemistry Development Kit and A lot of Bioclipse Scripting Language examples books, as explained in this blog post.

Now, Lars had already implemented a number of features requests I put in. The Altmetric logo was one of them, but also ORCID plugin, that will create a bibliography with just a short snippet of JavaScript and your ORCID identifier (oh, and a populated ORCID profile, of course).

He told me to use his template tool, and I gave it a try. I think I was an early adopter and the amount of documentation has improved since Friday, but with his help I wrote a plugin for PubMed identifiers. So, you can now simply put references in your webpages by just listing their PubMed identifiers (I used this tool to create a custom citation.js bundle with DOI, PubMed, and CSL support):

  <script src="./citation.js" type="text/javascript"></script>
    const { Cite } = require('citation-js')
    async function main (pmid) {
        let example = await Cite.async(pmid)
        let output = example.format('bibliography', {
            format: 'html',
            template: 'vancouver',
            lang: 'en-US',
            append ({DOI}) { return `doi:${DOI}` }
        document.getElementById("placeholder").innerHTML = output
<body onload="main('pmid:31281945')">
  <div id="placeholder">

Awesome! Give me some instrument, and I will try to find time to use it to study things. I think I'll be using citation.js in many projects in the coming years :) Note that the append() functionality can be used to add Altmetrics buttons or links to, say, EuropePMC. Well, just read his paper.

Give some a kid, and they will be proud.

Sunday, August 11, 2019

Structure of colibactin elucidated

Structure of colibactin.
Structure elucidation is still a thing. C&EN reported yesterday that a team has published the structure of colibactin (doi:10.1126/science.aax2685), previously not known, despite the major human health impact (cancer). Now, since the article did not seem to have a SMILES, InChI, InChIKey, or even an IUPAC name, I hope I redrew it correctly (see right). The manuscript and supplementary information is, btw, massive in experimental data. Sadly, little of that is FAIR :(

And because there is no open source IUPAC name generator, I cannot provide that either. But I've submitted the structure to PubChem, so hopefully we have the IUPAC name soon.

In the past I would have provided this info in my blog, but we now have Wikidata and Scholia. So, I created a new Wikidata item for the structure, with some initial info, like SMILES, InChI, and InChIKey (using Bacting, of course):

The new publication does not seem to provide experimental physchem properties of colibactin, but before reading the article in detail, I get the impression they simply do not get to synthesize enough of the compound to do such measurements. They do provide NMR and MS data, though. A lot.

Colibactin is one of those compounds a lot was already known about the biology, and there are some 42 articles in Wikidata that discuss the compound and its biological properties, and I linked them to the new item for the compound, and did some additional annotation, giving this nice Scholia page with this topic graph:

Sunday, August 04, 2019

Contributing to Climate Research?

As a chemist/biologist, my day-to-day work is not really related to climate research. Yet, the effects of the crisis are, of course. I have been pondering how I could contribute my small bits. And after some weeks, I realized that I could repurpose the Zika Corpus idea developed by Daniel Mietchen. And, of course, then there is our Scholia project, where annotation of research articles are visualized. So, given that the climate crisis is a truly global problem, I continued what others had started before me: annotating climate research articles with the region or location they are associated with. That way, you can look up the effects of the climate crisis in your own region.

Mind you, most literature is not annotated with main subject yet, let alone country. But that's at least something I can do (along with taking the train as often as possible, to replace the airplane). And you can join: here's the list of climate change articles without (additional) subject annotation. Another interesting annotation you can do: species.


Africa (part of it; it's a huge continent!)


Nanoinformatics page in Wikipedia

This spring I contributed to a joined project, coordinated by the NanoWG, to write a Wikipedia article about nanoinformatics (funded by NanoCommons). I dived into digging up the history of the term nanoinformatics, and isolate a few early sources where the terms was first used, coined if you like. At the same time, the page needed to give an encyclopedic summary of the research field. Thanks to everyone who contributed, in particularly John, Mark, and Fred!

I think we succeeded quite well, and the page has become a rich source, tho far from extensive, of literature. If you want a longer list of nanoinformatics literature, then perhaps check out the Scholia page about nanoinformatics (and notice the RSS feed, to get informed about new nanoinformatics articles):

Saturday, July 13, 2019

Standing on the shoulders: but the shoulders are 200 years old

"Houston, we have a problem. We're standing on the shoulders of old scholars, but it feels a bit shaky."

Well, no wonder. While rocket science has clear foundations, the physical laws of nature, for many other research fields it's trickier. We rely on hundreds of years of knowledge and assume (not trust) that work to be true. And that knowledge is seemingly disappearing very fast (remember my graveyard of chemical literature observation). Published literature, generally, is too hard to reproduce to be seen as an accurate capture of research history. In other words, these shoulders are 200 years old, and our support is failing. 

Open Science attempts to overcome these issues. It defines an environment where all research output is important, where every one has access to shoulders, and trust can be replaced by reproducibility. This is a huge transition, ongoing for some 20 years now.

With my work as one of the two Editors-in-Chief of the Journal of Cheminformatics, I try to contribute to making this happy, sooner than later. It's not been an easy ride, and there is so much left to do. And I do not always agree well with the effort put in by Springer Nature here, as clear from this reply.

Figure 1 from the latest editorial.
But I am happy to work with Rajarshi, Nina, Matthew, and Samuel to supporting the Open Science community in chemistry, for example, by allowing publications that describe a piece open source cheminformatics of software (Software article type). We're limited by what BioMedCentral can offer us, but within that context try to make a change.

The journal now exists 10 years, as marked by our latest editorial. We here describe our adoption of GitHub as a free, extra service, where we fork source code published in our journal, and announce our adoption of the obligatory ORCID for all authors.

These things bring me back to those shoulders. The full adoption of the ORCID allows research to be more easily found (more FAIR) and the copying of the source code aims at making the shoulders on which future cheminformatics stands more solid. Minor steps. But even minor steps matter.

Let's see where our journals takes open science cheminformatics.

Oh, and since you are reading this, I would love to see the American Chemical Society be more open to Open Science too. Please join me in requesting them to join the Initiative for Open Citations.

Saturday, June 22, 2019

Bacting: Bioclipse from the command line

Source. Wikiepdia. Public Domain.
Because more and more cheminformatics I do is with Bioclipse scripts (see doi:10.1186/1471-2105-10-397) and that Bioclipse is currently unmaintained and has become hard to install, I decided to take the plunge and rewrite some stuff so that I could run the scripts from the command line. I wrote up the first release back in April.

Today, I release Bacting 0.0.5 (doi:10.5281/zenodo.3252486) which is the first release you can download from one of the main Maven repositories. I'm still far from a Maven or Grapes expert, but at least you can use Bacting now like this without actually having to download and compile the source code locally first:


workspaceRoot = "."
cdk = new net.bioclipse.managers.CDKManager(workspaceRoot);

println cdk.fromSMILES("CCO")

If you have been using Bacting before, then please note the change in groupId. If you want to check out all functionality, have a look at the changelogs of the releases.

If you want to cite Bacting, please cite the Bioclipse 2 paper and for the version release, follow the instructions on Zenodo. Pending an article. The Journal of Open Source Software? Sounds like a good idea!

Sunday, June 16, 2019

National scholarly societies. Why?

Plan S has caused quite some discussion about what knowledge dissemination is. When it was announced, I was hesitant. But very quickly the opposition of Plan S convinced me that apparently something like Plan S is needed. I think Plan S focuses way too much on journal-channeled publishing, whereas I had rather seen it focus on Open Science (it partly does). We argued that much with cOAlition S recently (doi:10.5281/zenodo.2560200):

The risks brought forward by Plan S opponent are real. I don't always agree on the arguments, or simply just don't understand them. With some I agree, but disagree on the alternative. This has been a difficult position to follow, as some discussions taught me. For example, some claimed that I am in favor of article processing costs. Only in a toxic, black-white world, not being against them equals being in favor of them.

Journals articles have shown to be an expensive exercise of knowledge dissemination. It was the right solution, certainly 200 years ago. The cost has to be paid by someone. Via subscription (the "old" model), via package deals with nationals, universities, etc (upcoming), via a friendly funder (some wealthy foundation), or via the authors. Not accepting that the publishing costs money is utopian, if you ask me.

However, what is essential, and what too few people talk about, is that the open license of the research output. If you cannot share research output without paying again and again (instead of once), we inhibit innovation. If I cannot share literature with students, I cannot properly train them for their job.

So, it feels kinda awkward that I am considered doing something wrong, if I ensure my work is available under an CC-BY license. Check my fail rate at ImpactStory (e.g. a series of poster abstracts in Tox Sci).

Anyway, about two topics I want to clarify. First, APC should be as low as possible. That means the infrastructure should be efficient, reducing the amount of work. Open infrastructures likely have an important role here. Why do we not have open source articles submission platforms? Why don't we have open standard XML formats with matching editors so that we can submit articles in that format, rather than LaTeX or Word? Etc.

Every cent I spend on APC, I cannot spend on other research tasks. One obvious answer then, IMHO, is to return to publishing less in journals, and sharing more via other, better channels, such as open databases. I find it hard to reconcile complainers about the cost of publishing, but insisting on expensive business models.

So, I wondered what the APCs are of CC-BY publishing of the journals I published in. And I started adding this data to Wikidata (#opennotebookscience), with a zero APC:

I did not always pay this. There are reductions, sometimes a co-author pays, etc. But I have no problem paying for services rendered. And when I paid, it was always part of my job, and my employer (or project) pays. Now, there are rumors that scholars sometimes have to pay on their own account, as if it is representation cost. I'm appalled by this. I think the employers are bullying their scholars in an unacceptable way. There was a lot of discussion about academic freedom, but your employer forcing you this way into publishing in certain journals sounds like an example of that. We can discuss who is responsible for this: the funder or the employer. I know my answer.

Scholarly societies
Two other aspects in the discussions are "what about poor countries" and "what about scholarly societies". I like to combine these. I welcome scholarly societies to pick up knowledge dissemination, in an open science way. I wish all scholarly societies would do that. But I am not sure why that necessarily has to be coupled at sponsoring society activities. That particularly feels awkward in the notion that we tend to have national societies. Why?

Why should an African scholar have to fund educational activities held in the United States or Europe via publishing in their journals? What is wrong with me paying a scholarly society APC so that everyone in the world can read my literature? What is wrong with wanting them to have access to all literature?

What is wrong with me wanting to be able to read all literature? Despite The Netherlands not being a poor country, Maastricht University is far from a rich university, and I regularly run into paywalls myself.

Yes, asking the Global South, or anyone (like a small SME) to pay 5000 euro is a lot (hell, for me it is; I'm happy that that is rare). Most publishers are not doing that. There is price differentiation and the Global South doesn't pay the European prices (tho publishers must do better in being transparent about this), which in response, some see as patronizing or even colonial (dividing the world in economic zones is quite common; is it unethical? well, there are more aspects of our economic systems I am not happy about).

I think the bigger problem is why Western scholars (the Global North?) is not publishing in journals published in/by the Global South. Why is that?

If we want a scholarly community to be internationally inclusive, why do we still have national scholarly societies? Maybe we can stop with that, please? What if I was not member of the Dutch chemical society, KNCV, but I was member of the Chemical Society, an scholarly society independent from continent or country?

Now, I am happy to see others are thinking in this direction too. For example, the Metabolomics Society takes this approach and a growing group of universities is rebooting the idea of a university publisher, but not limited to one university of even country (e.g. University Journals, HT Jeroen and Erik).

Because if we keep insisting on publishing in Global North (or western-led) journals (e.g. journals of Global North societies), I think we have a bigger problem than APCs, with respect to the North/South divide (and there certainly is a problem!).

I'm looking forward to reading your thoughts on how we can really reform open science knowledge dissemination.

Monday, June 10, 2019

Preprint servers. Why?

Recent preprints from researchers
in the BiGCaT group.
Henry Rzepa asked us the following recently: ChemRxiv. Why? I strongly recommend reading his pondering. I agree with a number of them, particularly the point about the following. To follow the narrative of the joke: "how many article versions does it take for knowledge to disseminate?", the answer sometimes seems to be: "at least three, to make enough money of the system".

Now, I tend to believe preprints are a good thing (see also my interview in Preprintservers, doen of niet?, C2W, 2016. PDF): release soon, release often has served open science well. In that sense, a preprint can be like that: an form of open notebook science.

However, just as we suffer from data dumps for open source software, we see exactly the same with (open access) publishing now. Is the paper ready to be submitted for peer review, oh, let's quickly put it on a preprint server. A very much agree with Henry that the last thing we are waiting for is a third version of a published article. This is what worries me a great deal in the "green Open Access" discussion.

But it can be different. For example, people in our BiGCaT group actually are building up a routine of posting papers just before conferences. Then the oral presentation gives a laymens outline of the work, and if people want to really understand what the work is about, they can read the full paper. Of course, with the note that a manuscript may actually not be sufficient for that, so the preprint should support open science.

But importantly, a preprint is not a replacement for an proper CC-BY-licensed version of record (VoR). If the consensus that that is what preprints are about, then I'm no longer a fan.

Tuesday, May 21, 2019

Scholia: an open source platform around open data

Some 2.5 years ago Finn Nielsen started Scholia. I have been blogging about it a few times, and thanks to Finn, Lane Rasberry, and Daniel Mietchen, we were awarded a grant by the Alfred P. Sloan Foundation to continue working on it (grant: G-2019-11458). I'll tweet more about how it fits the infrastructure to support our core research lines, but for now just want to mention that we published the full proposal in RIO Journal.

Oh, just as a teaser and clickbait, here's one of the use cases. dissemination of knowledge of metabolites and chemicals in general (full poster):

Saturday, May 18, 2019

LIPID MAPS: mass spectra and species annotation from Wikidata

Part of the LIPID MAPS classification
scheme in Wikidata (try it).
A bit over a week I attended LIPID MAPS Working Group meeting in Cambridge, as I have become member of the Working Group 2: Tools and Technical Committee in autumn. That followed a fruitful effort by Eoin Fahy to make several LIPID MAPS pathways available in WikiPathways (see this Lipids Portal), e.g. the Omega-3/Omega-6 FA synthesis pathway. It was a great pleasure to attend the meeting, meet everyone, and I learned a lot about the internals of the LIPID MAPS project.

I showed them how we contribute to WikiPathways, particularly in the area of lipids. Denise Slenter and I have been working on having more identifier mappings in Wikidata, among which the lipids. Some results of that work was part of this presentation. One of the nice things about Wikidata is that you can make live Venn diagrams, e.g. compounds in LIPID MAPS for which Wikidata also has a statement about which species it is found in (try it):

SELECT ?lipid ?lipidLabel ?lmid ?species ?speciesLabel
            ?source ?sourceLabel ?doi
  ?lipid wdt:P2063 ?lmid ;
         p:P703 ?speciesStatement .
    ?speciesStatement prov:wasDerivedFrom/pr:P248 ?source ;
                      ps:P703 ?species .
    OPTIONAL { ?source wdt:P356 ?doi }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".

A second query searches lipids for which also mass spectra are found in MassBank (try it):

  ?lipid ?lipidLabel ?lmid
  (GROUP_CONCAT(DISTINCT ?massbanks) as ?massbank)
  ?lipid wdt:P2063 ?lmid ;
         wdt:P6689 ?massbanks .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
} GROUP BY ?lipid ?lipidLabel ?lmid


Saturday, May 04, 2019

Wikidata, CompTox Chemistry Dashboard, and the DSSTOX substance identifier

The US EPA published a paper recently about the CompTox Chemistry Dashboard (doi:10.1186/s13321-017-0247-6). Some time ago I worked with Antony Williams and we proposed a matching Wikidata identifier. When it was accepted, I used a InChIKey-DSSTOX identifier mapping data sets by Antony (doi:10.6084/M9.FIGSHARE.3578313.V1) to populate Wikidata with links. Overtime, when more InChIKeys were found in Wikidata, I would use this script to add additional mappings. That resulted in this growth graph:

Source: Wikidata.
Now, about a week ago Antony informed me he worked with someone of Wikipedia to have the DSSTOX automatically show up in the ChemBox, which I find awesome. It's cool to see your work on about 38 thousand (!) Wikipedia pages :)
Part of the ChemBox of perfluorooctanoic acid.
(I'm making the assumption that all 38 thousand Wikidata pages for chemicals have Wikipedia equivalents, which may be a false assumption.)