Saturday, July 13, 2019

Standing on the shoulders: but the shoulders are 200 years old

"Houston, we have a problem. We're standing on the shoulders of old scholars, but it feels a bit shaky."

Well, no wonder. While rocket science has clear foundations, the physical laws of nature, for many other research fields it's trickier. We rely on hundreds of years of knowledge and assume (not trust) that work to be true. And that knowledge is seemingly disappearing very fast (remember my graveyard of chemical literature observation). Published literature, generally, is too hard to reproduce to be seen as an accurate capture of research history. In other words, these shoulders are 200 years old, and our support is failing. 

Open Science attempts to overcome these issues. It defines an environment where all research output is important, where every one has access to shoulders, and trust can be replaced by reproducibility. This is a huge transition, ongoing for some 20 years now.

With my work as one of the two Editors-in-Chief of the Journal of Cheminformatics, I try to contribute to making this happy, sooner than later. It's not been an easy ride, and there is so much left to do. And I do not always agree well with the effort put in by Springer Nature here, as clear from this reply.

Figure 1 from the latest editorial.
But I am happy to work with Rajarshi, Nina, Matthew, and Samuel to supporting the Open Science community in chemistry, for example, by allowing publications that describe a piece open source cheminformatics of software (Software article type). We're limited by what BioMedCentral can offer us, but within that context try to make a change.

The journal now exists 10 years, as marked by our latest editorial. We here describe our adoption of GitHub as a free, extra service, where we fork source code published in our journal, and announce our adoption of the obligatory ORCID for all authors.

These things bring me back to those shoulders. The full adoption of the ORCID allows research to be more easily found (more FAIR) and the copying of the source code aims at making the shoulders on which future cheminformatics stands more solid. Minor steps. But even minor steps matter.

Let's see where our journals takes open science cheminformatics.

Oh, and since you are reading this, I would love to see the American Chemical Society be more open to Open Science too. Please join me in requesting them to join the Initiative for Open Citations.

Saturday, June 22, 2019

Bacting: Bioclipse from the command line

Source. Wikiepdia. Public Domain.
Because more and more cheminformatics I do is with Bioclipse scripts (see doi:10.1186/1471-2105-10-397) and that Bioclipse is currently unmaintained and has become hard to install, I decided to take the plunge and rewrite some stuff so that I could run the scripts from the command line. I wrote up the first release back in April.

Today, I release Bacting 0.0.5 (doi:10.5281/zenodo.3252486) which is the first release you can download from one of the main Maven repositories. I'm still far from a Maven or Grapes expert, but at least you can use Bacting now like this without actually having to download and compile the source code locally first:


workspaceRoot = "."
cdk = new net.bioclipse.managers.CDKManager(workspaceRoot);

println cdk.fromSMILES("CCO")

If you have been using Bacting before, then please note the change in groupId. If you want to check out all functionality, have a look at the changelogs of the releases.

If you want to cite Bacting, please cite the Bioclipse 2 paper and for the version release, follow the instructions on Zenodo. Pending an article. The Journal of Open Source Software? Sounds like a good idea!

Sunday, June 16, 2019

National scholarly societies. Why?

Plan S has caused quite some discussion about what knowledge dissemination is. When it was announced, I was hesitant. But very quickly the opposition of Plan S convinced me that apparently something like Plan S is needed. I think Plan S focuses way too much on journal-channeled publishing, whereas I had rather seen it focus on Open Science (it partly does). We argued that much with cOAlition S recently (doi:10.5281/zenodo.2560200):

The risks brought forward by Plan S opponent are real. I don't always agree on the arguments, or simply just don't understand them. With some I agree, but disagree on the alternative. This has been a difficult position to follow, as some discussions taught me. For example, some claimed that I am in favor of article processing costs. Only in a toxic, black-white world, not being against them equals being in favor of them.

Journals articles have shown to be an expensive exercise of knowledge dissemination. It was the right solution, certainly 200 years ago. The cost has to be paid by someone. Via subscription (the "old" model), via package deals with nationals, universities, etc (upcoming), via a friendly funder (some wealthy foundation), or via the authors. Not accepting that the publishing costs money is utopian, if you ask me.

However, what is essential, and what too few people talk about, is that the open license of the research output. If you cannot share research output without paying again and again (instead of once), we inhibit innovation. If I cannot share literature with students, I cannot properly train them for their job.

So, it feels kinda awkward that I am considered doing something wrong, if I ensure my work is available under an CC-BY license. Check my fail rate at ImpactStory (e.g. a series of poster abstracts in Tox Sci).

Anyway, about two topics I want to clarify. First, APC should be as low as possible. That means the infrastructure should be efficient, reducing the amount of work. Open infrastructures likely have an important role here. Why do we not have open source articles submission platforms? Why don't we have open standard XML formats with matching editors so that we can submit articles in that format, rather than LaTeX or Word? Etc.

Every cent I spend on APC, I cannot spend on other research tasks. One obvious answer then, IMHO, is to return to publishing less in journals, and sharing more via other, better channels, such as open databases. I find it hard to reconcile complainers about the cost of publishing, but insisting on expensive business models.

So, I wondered what the APCs are of CC-BY publishing of the journals I published in. And I started adding this data to Wikidata (#opennotebookscience), with a zero APC:

I did not always pay this. There are reductions, sometimes a co-author pays, etc. But I have no problem paying for services rendered. And when I paid, it was always part of my job, and my employer (or project) pays. Now, there are rumors that scholars sometimes have to pay on their own account, as if it is representation cost. I'm appalled by this. I think the employers are bullying their scholars in an unacceptable way. There was a lot of discussion about academic freedom, but your employer forcing you this way into publishing in certain journals sounds like an example of that. We can discuss who is responsible for this: the funder or the employer. I know my answer.

Scholarly societies
Two other aspects in the discussions are "what about poor countries" and "what about scholarly societies". I like to combine these. I welcome scholarly societies to pick up knowledge dissemination, in an open science way. I wish all scholarly societies would do that. But I am not sure why that necessarily has to be coupled at sponsoring society activities. That particularly feels awkward in the notion that we tend to have national societies. Why?

Why should an African scholar have to fund educational activities held in the United States or Europe via publishing in their journals? What is wrong with me paying a scholarly society APC so that everyone in the world can read my literature? What is wrong with wanting them to have access to all literature?

What is wrong with me wanting to be able to read all literature? Despite The Netherlands not being a poor country, Maastricht University is far from a rich university, and I regularly run into paywalls myself.

Yes, asking the Global South, or anyone (like a small SME) to pay 5000 euro is a lot (hell, for me it is; I'm happy that that is rare). Most publishers are not doing that. There is price differentiation and the Global South doesn't pay the European prices (tho publishers must do better in being transparent about this), which in response, some see as patronizing or even colonial (dividing the world in economic zones is quite common; is it unethical? well, there are more aspects of our economic systems I am not happy about).

I think the bigger problem is why Western scholars (the Global North?) is not publishing in journals published in/by the Global South. Why is that?

If we want a scholarly community to be internationally inclusive, why do we still have national scholarly societies? Maybe we can stop with that, please? What if I was not member of the Dutch chemical society, KNCV, but I was member of the Chemical Society, an scholarly society independent from continent or country?

Now, I am happy to see others are thinking in this direction too. For example, the Metabolomics Society takes this approach and a growing group of universities is rebooting the idea of a university publisher, but not limited to one university of even country (e.g. University Journals, HT Jeroen and Erik).

Because if we keep insisting on publishing in Global North (or western-led) journals (e.g. journals of Global North societies), I think we have a bigger problem than APCs, with respect to the North/South divide (and there certainly is a problem!).

I'm looking forward to reading your thoughts on how we can really reform open science knowledge dissemination.

Monday, June 10, 2019

Preprint servers. Why?

Recent preprints from researchers
in the BiGCaT group.
Henry Rzepa asked us the following recently: ChemRxiv. Why? I strongly recommend reading his pondering. I agree with a number of them, particularly the point about the following. To follow the narrative of the joke: "how many article versions does it take for knowledge to disseminate?", the answer sometimes seems to be: "at least three, to make enough money of the system".

Now, I tend to believe preprints are a good thing (see also my interview in Preprintservers, doen of niet?, C2W, 2016. PDF): release soon, release often has served open science well. In that sense, a preprint can be like that: an form of open notebook science.

However, just as we suffer from data dumps for open source software, we see exactly the same with (open access) publishing now. Is the paper ready to be submitted for peer review, oh, let's quickly put it on a preprint server. A very much agree with Henry that the last thing we are waiting for is a third version of a published article. This is what worries me a great deal in the "green Open Access" discussion.

But it can be different. For example, people in our BiGCaT group actually are building up a routine of posting papers just before conferences. Then the oral presentation gives a laymens outline of the work, and if people want to really understand what the work is about, they can read the full paper. Of course, with the note that a manuscript may actually not be sufficient for that, so the preprint should support open science.

But importantly, a preprint is not a replacement for an proper CC-BY-licensed version of record (VoR). If the consensus that that is what preprints are about, then I'm no longer a fan.

Tuesday, May 21, 2019

Scholia: an open source platform around open data

Some 2.5 years ago Finn Nielsen started Scholia. I have been blogging about it a few times, and thanks to Finn, Lane Rasberry, and Daniel Mietchen, we were awarded a grant by the Alfred P. Sloan Foundation to continue working on it (grant: G-2019-11458). I'll tweet more about how it fits the infrastructure to support our core research lines, but for now just want to mention that we published the full proposal in RIO Journal.

Oh, just as a teaser and clickbait, here's one of the use cases. dissemination of knowledge of metabolites and chemicals in general (full poster):

Saturday, May 18, 2019

LIPID MAPS: mass spectra and species annotation from Wikidata

Part of the LIPID MAPS classification
scheme in Wikidata (try it).
A bit over a week I attended LIPID MAPS Working Group meeting in Cambridge, as I have become member of the Working Group 2: Tools and Technical Committee in autumn. That followed a fruitful effort by Eoin Fahy to make several LIPID MAPS pathways available in WikiPathways (see this Lipids Portal), e.g. the Omega-3/Omega-6 FA synthesis pathway. It was a great pleasure to attend the meeting, meet everyone, and I learned a lot about the internals of the LIPID MAPS project.

I showed them how we contribute to WikiPathways, particularly in the area of lipids. Denise Slenter and I have been working on having more identifier mappings in Wikidata, among which the lipids. Some results of that work was part of this presentation. One of the nice things about Wikidata is that you can make live Venn diagrams, e.g. compounds in LIPID MAPS for which Wikidata also has a statement about which species it is found in (try it):

SELECT ?lipid ?lipidLabel ?lmid ?species ?speciesLabel
            ?source ?sourceLabel ?doi
  ?lipid wdt:P2063 ?lmid ;
         p:P703 ?speciesStatement .
    ?speciesStatement prov:wasDerivedFrom/pr:P248 ?source ;
                      ps:P703 ?species .
    OPTIONAL { ?source wdt:P356 ?doi }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".

A second query searches lipids for which also mass spectra are found in MassBank (try it):

  ?lipid ?lipidLabel ?lmid
  (GROUP_CONCAT(DISTINCT ?massbanks) as ?massbank)
  ?lipid wdt:P2063 ?lmid ;
         wdt:P6689 ?massbanks .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
} GROUP BY ?lipid ?lipidLabel ?lmid


Saturday, May 04, 2019

Wikidata, CompTox Chemistry Dashboard, and the DSSTOX substance identifier

The US EPA published a paper recently about the CompTox Chemistry Dashboard (doi:10.1186/s13321-017-0247-6). Some time ago I worked with Antony Williams and we proposed a matching Wikidata identifier. When it was accepted, I used a InChIKey-DSSTOX identifier mapping data sets by Antony (doi:10.6084/M9.FIGSHARE.3578313.V1) to populate Wikidata with links. Overtime, when more InChIKeys were found in Wikidata, I would use this script to add additional mappings. That resulted in this growth graph:

Source: Wikidata.
Now, about a week ago Antony informed me he worked with someone of Wikipedia to have the DSSTOX automatically show up in the ChemBox, which I find awesome. It's cool to see your work on about 38 thousand (!) Wikipedia pages :)
Part of the ChemBox of perfluorooctanoic acid.
(I'm making the assumption that all 38 thousand Wikidata pages for chemicals have Wikipedia equivalents, which may be a false assumption.)

Wednesday, April 24, 2019

Open Notebook Science: the version control approach

Jean-Claude Bradley pitched the idea of Open Notebook Science, or Open-notebook science as the proper spelling seems to be. I have used notebooks a lot, but ever since I went digital, the use went down. During my PhD studies I still extensively used them. But in the process, I changed my approach. Influenced by open source practices.

After all, open source has had a long history of version control, where commit messages explain the reason why some change was made. And people that ever looked at my commits, know that my commits tend to be small. And know that my messages describe the purpose of some commit.

That is my open notebook. It is essential to record why a certain change was made and what exactly that change was. Trivial with version control. Mind you, version control is not limited to source code. Using the right approaches, data and writing can easily be tracked with version control too. Just check, for example, my GitHub profile. You will find journal articles been written, data collected, just as if they were equal research outputs (they are).

Another great example of version control for writing and data is provided by Wikipedia and Wikidata. Now, some changes I found hard to track there: when I asked the SourceMD tool (great work by Magnus Manske) to create items for books, I want to see the changes made. The tool did link to the revisions made at some point, but this service integration seems to break down now and then. Then I realized that I could use the EditGroups tool directly (HT to who wrote that), and found this specific page for my edits, which includes not just those via SourceMD but also all edits I made via QuickStatements (also by Magnus):

If only I could give a "commit message" which each QuickStatements job I run. Can I?

Saturday, April 13, 2019

Bioclipse on the command line

Screenshot of Bioclipse 2.
Over the past seven years there has been a lot of chemistry interoperability work I have done using Bioclipse (doi:10.1186/1471-2105-8-59, doi:10.1186/1471-2105-10-397). The code is based on Eclipse, which gives a great GUI experience, but also turned out hard to maintain. Possibly, that was because of a second neat feature, that you could plugin libraries into the Python, JavaScript, and Groovy scripting environment which allows people to automate things in Bioclipse. Over the course of time, so many libraries have been integrated, making so many scientific toolkit available at the tip of your fingers. Of the three programming languages, I have used Groovy the most, being close to the Java language, but with with a lot of syntactic goodies.

In fact, I have blogged about the scripts I wrote on my occasions and in 2015 I wrote up a few blog posts on how to install new extensions:

But publishing and installing new Bioclipse 2.6.2 extension remained complicated (installing Bioclipse itself it quite trivial). And that while the scripts are so useful, and I need others to start using them. I do not scale. Second, when I cite these scripts, they were too hard to use by reviewers and readers. To get some idea of a small subset of the functionality, read our book A lot of Bioclipse Scripting Language examples.

So, last x-mas I set out with the wish to be able to have others much more easily run my scripts and, second, be able to run them from the command line. To achieve that, installing and particularly publishing Bioclipse extensions had to become much easier. Maybe as easy of Groovy just Grab-bing the dependencies from the script itself. So, Bioclipse available from Maven Central, or so.

Of course, this approach would likely loose a lot of wonderful functionality, like the graphical UX, the plugin system, the language injection, and likely more. So, one important requirements was that any script using the command line must be identical to the script in Bioclipse itself. Well, with a few permissible exceptions: we are allowed to inject the Bioclipse managers manually.

Well, of course, I would not have been blogging this had I not succeeded to reach these goals in some way. Indeed, following up from a wonderful metaRbolomics meeting organized by de.NBI (~ ELIXIR Germany), and the powerful plans discussed with Emma Schymanski (and some ongoing work of persistent toxicants), and, fairly, actually not drowning in failed deadlines, just regularly way behind deadlines, and since I have a research line to run, I dived into hackmode. In some 14 hours, mostly in the evening hours of the past two days, I got a proof of principle up and running. The name is a reference to all the wonderful linguistic fun we had when I worked in Uppsala, thanks to Carl Mäsak, e.g. discussing the term Bioclipse Scripting Language and Perl 6.

It is not available yet from Maven Central, so there is a manual mvn clean install involved at this moment, but after that (the command installs it in your local Maven repository which will be recognized by Groovy), you can get started with something like (I marked in blue to extra sugar needed on the command line; the black code runs as is in Bioclipse 2.6.2):


workspaceRoot = "."
def cdk = new net.bioclipse.managers.CDKManager(workspaceRoot);

list = cdk.createMoleculeList()
println list
println cdk.fromSMILES("COC")

What now?
In the future, once it is available on Maven Central, you will be able to skip the local install command, and @Grab will just fetch things from that online repository. I will be tagging version 0.0.1 today, as I got my important script running that takes one or more SMILES strings, checks Wikidata, and makes QuickStatements to add missing chemicals. The first time you've (maybe) seen that, was three years ago, in this blog post.

You may wonder: why?? I asked myself the same thing, but there are a few things over the past 24 hours that I could answer and which may sketch where this is going:

  1. that BSL book can actually show running the code and show the output in the book, just like with my CDK book;
  2. maybe we can use Bioclipse managers in Nextflow;
  3. Bioclipse offers interoperability layers, allowing me to pass a chemical structure from one Java library to another (e.g. from the CDK to Jmol to JOELib);
  4. it allows me to update library versions without having to rebuild a full new Bioclipse stack (I'm already technically unable, let alone timewise unable);
  5. I can start sharing Bioclipse scripts with articles that people can actually run; and,
  6. all scripts are compatible, and all extensions I make can be easily copied into the main Bioclipse repository, if there ever will be a next major Bioclipse version (seems unlike now).

Now, it's just being patient and migrating manager by manager. It may be possible to use the the existing manager code, but that comes with so much language injection, that I decided to just take advantage of Open Science and just copy/paste the code. Most of the code is the same, minus progress monitors, and replacing Eclipse IFile code with regular Java code. But there are tons of managers, and reaching even 50% coverage will take, at the speed I can offer, months. Therefore, I'll focus on scripts I share with others, focus on reuse and reproducibility.

More soon!

Sunday, April 07, 2019

History of the term Open Science #1: the early days

Screenshot of the Open Science History group
on CiteULike.
Open Science has been around for some time. Before Copyright became a thing, knowledge dissemination was mostly limited by how easy you could get knowledge from one place to another. The introduction of Copyright changed this. No longer the question was how to get people to know the new knowledge to how to get people to pay for new knowledge. One misconception, for example, is that publishing is a free market. Yes, you can argue that you can publish anywhere you like (theoretically, at least, but reality says otherwise), but the monopoly is in getting access: for every new fact (and republishing the same fact is a faux pas), there is exactly one provider of that fact.

Slowly this is changing, but only slowly. What this really needs, is open licenses, just like open source licenses. Licenses that allow fixing typos, allow resharing with your students, etc.

But contrary to what has been prevalent in the Plan S discussion, these ideas are not new. And people have been trying Open Science for more than two decades already.

I have been trying to dig up the oldest references (ongoing effort) of the term Open Science (in the current meaning), and had a CiteULike group for that. But CiteULike is shutting down, so I will blog the references I found, and add some context.

A first article to mention is this 1998 article that mentions Open Science: Common Agency Contracting and the Emergence of "Open Science" Institutions The American Economic Review, Vol. 88, No. 2. (May 1998), pp. 15-21 by Paul A. David. Worth reading, but does require reading some of the cited literature.

The follow two magazine articles took the term Open Science to a wider public, and in reply to a conference held at Brookhaven National Laboratory:

I would also like to note that the website by Dan Gezelter went online in the late nineties already, which I have used in various of my source code projects, and, of course, also has been used by the Chemistry Development Kit from the start.

Wednesday, April 03, 2019

BioSchemas CreativeWork annotation in Bioconductor Vignettes

Since the EU BioHackathon in Paris last year, I've picked up Bioschemas stuff more extensively, to help the ELIXIR Metabolomics and Toxicology (in development) communities getting their stuff more FAIR. We could annotate training material already (see this ELIXIR NL post), but big boon was annotation of vignettes on Bioconductor. Over the past 2-3 months I have been exploring this, and on Monday at the #metaRbolomics meeting in Wittenberg, with a room full or R users, I got the right pointers to a promising lead.

So, because the vignettes are generated differently than Markdown on GitHub, I had to find the right hooks. In the end, I found these in one vignette adding a Google Analytics tracker in the header of the file. Bingo!

Screenshot of Google's Structured Data Testing Tool.
So, here's how to do it. The R package set up allows adding custom HTML to the generated HTML, either in the header and at the start or end of the body. I went for the header. But I had to wait two days for the BioConductor website to make a new BridgeDbR package binary (1st day), and for it to update the website (2nd day).

The HTML snippet (saved as bioschemas.html) to add is basically a <script> element with a fragment of JSON-LD:

<script type="application/ld+json">
  "about":"This tutorial describes how to use the BridgeDbR package for identifier mapping.",
  "name":"BridgeDbR Tutorial",
    "name":"Egon Willighagen",
  "difficultyLevel": "beginner",
  "keywords":"ELIXIR RIR, BridgeDb",

The other half of the story is to instruct the HTML generation pipeline to add it, which is done with this bit of YAML in your Markdown file (part of it you should already have):

    toc_float: true
      in_header: bioschemas.html

Check the full folder here.

Saturday, March 30, 2019

What metabolites are found in which species? Nanopublications from Wikidata

In December I reported about Groovy code to create nanopublications. This has been running for some time now, extracting nanopubs that assert that some metabolite is found in some species. I send the resulting nanopubs to Tobias Kuhn, to populate his Growing Resource of Provenance-Centric Scientific Linked Data (doi:10.1109/eScience.2018.00024, PDF).

Each data set comes with an index pointing to the individual nanopubs, and that looks like this:

I wonder what options I have to to archive the full set up nanopublications on Figshare or Zenodo, and see that DOI show up here...

Tuesday, March 05, 2019

New paper: "Beyond Pathway Analysis: Identification of Active Subnetworks in Rett Syndrome"

Figure 4 of the article.
Ryan Miller and Friederike Ehrhart worked together in this paper on furthering our understanding of the Rett syndrome  (doi:10.3389/fgene.2019.00059). They looked at the following: our biological pathways are social constructs that help us think and talk about the biology in our body (and other animals, plants, etc, of course). What if we ignore the boundaries of those constructs, can we learn the pathways? Turns out, sort of.

Using PathVisio, WikiPathways, and Cytoscape's jActiveModules they developed new modules that capture a particular aspect of the biology, and, as usual, color the transcriptional change on top of that. The Methods is richly annotated and all stuff is open source.

The authors conclude with mostly bioinformatics conclusions. No new shocking new insights into Rett syndrome (yet, but unfortunately), but they indicate that by taking advantage of our interoperability approaches (e.g. the ELIXIR Recommended Interoperability Resource BridgeDb, using mappings from Ensembl, HMDB, ChEBI, and Wikidata) pathway resources can be integrated, allowing these approaches.

Mind you, each pathway, and regularly down to the gene, metabolite, and interaction level, the information is not just built in collaboration with research communities, curated, but also backed by literature: 22494 unique PubMed references of which almost 4000 unique to WikiPathways (i.e. not in Reactome).

Have fun!

Monday, March 04, 2019

New projects: RiskGONE and NanoSolveIT

This January two new Horizon 2020 projects started for me: RiskGone and NanoSolveIT. It kept me busy in the past few weeks, with the kick-off meeting of the latter task week in Athens. Both continue on previous work of the EU NanoSafety Cluster, and I'm excited to continue with research done during the eNanoMapper project.

NanoSolveIT "aspires to introduce a ground-breaking in silico Integrated Approach to Testing and Assessment (IATA) for the environmental health and safety of Nanomaterials (NM), implemented through a decision support system packaged as both a stand-alone open software and via a Cloud platform."

I will be involved here in the knowledge infrastructure. Plenty of research there to be done around the representation of chemical composition of the nanomaterials, the structuring and consistency of ontologies to capture and integrate everything, how to capture our knowledge around the adverse outcome pathways, and how to use this all in predictive computation.

"The focus of RiskGONE will be to produce nano-specific draft guidance documents for application to ENM RA; or, alternatively, to suggest ameliorations to OECD, ECHA, and ISO/CEN SOPs or guidelines. Rather than producing assays and methods ex novo, this will be achieved through Round Robin exercises and multimodal testing of OECD TGs and ECHA methods supporting the “Malta-project”, and on methods not yet considered by OECD." (from the CORDIS website)

Here our involvement will be around similar topics.

Oh, and like all new H2020 projects, FAIR and Open Data is central words.

Sunday, February 17, 2019

Browsing like it's 1990

Ruben Verborgh pointed me this nice CERN side project ("my particles are colliding"): browsing like it's 1990. This is what WikiPathways would have looked like back then:

Comparing Research Journals Quality #2: FAIR metrics of journals

Henry Rzepa pointed me to this helpful CrossRef tool that shows publisher and journal level metrics for FAIRness (see also this post):
FAIR metrics for the Journal of Cheminformatics.
The Journal of Cheminformatics is doing generally well. This is what FAIR metrics are about: they show you what you can improve. They show you how you can become a (better) open scientist. And our journal has a few attention points:
J. Cheminform. does not do well with sending these bits of information to CrossRef.
It's nice to see we already score well on ORCIDs and funder identifiers. I am not sure why the abstracts are not included, and text mining URLs could point to something useful too, I guess. The license URL sounds a bit redundant, since all articles are CC-BY, but downstream aggregators should not guess this from a journal name (or ISSN), and I'd welcome this proper annotation too.

Saturday, February 09, 2019

Comparing Research Journals Quality #1: FAIRness of journal articles

What a traditional research article
looks like. Nice layout, hard to
reuse the knowledge from.
Image: CC BY-SA 4.0.
After Plan S was proposed, there finally was a community-wide discussion on the future of publishing. Not everyone is clearly speaking out if they want open access or not, but there's a start for more. Plan S aims to reform the current model. (Interestingly, the argument that not a lot of journals are currently "compliant" is sort of the point of the Plan.) One thing it does not want to reform, is the quality of the good journals (at least, I have not seen that as one of the principles). There are many aspects to the quality of a research journal. There are also many things that disguise themselves as aspects of quality but are not. This series discusses quality of a journal. We skip the trivial ones, like peer review, for now, because I honestly do not believe that the cOAlition S funders want worse peer review.

We start with FAIRness (doi:10.1038/sdata.2016.18). This falls, if you like, under the category of added value. FAIRness does not change the validness of the conclusions of an article, it just improves the rigor of the knowledge dissemination. To me, a quality journal is one that takes knowledge dissemination seriously. All journals have a heritage of being printed on paper, and most journals have been very slows in adopting innovative approaches. So, let's put down some requirements of the journal of 2020.

First the about the article itself:

About findable

  • uses identifiers (DOI) at least at article level, but possibly also for figures and supplementary information
  • provides data of an article (including citations)
  • data is actively distributed (PubMed, Scopus, OpenCitations, etc)
  • maximizes findability by supporting probably more than one open standard
About accessible
  • data can be accessed using open standards (HTTP, etc)
  • data is archived (possibly replicated by others, like libraries)
About interoperable
  • data is using open standards (RDF, XML, etc)
  • data uses open ontologies (many open standards exist, see this preprint)
  • uses linked data approaches (e.g. for citations)
About reusable
  • data is as complete as possible
  • data is available under an Open Science compliant license
  • data is uses modern and used community standards
Pretty straightforward. For author, title, journal, name, year, etc, most journals apply this. Of course, bigger publishers that invested in these aspects many moons ago can be compliant much easier, because they already were.

Second, what about the content of the article? There we start seeing huge differences.

About findable
  • important concepts in the article are easily identified (e.g. with markup)
  • important concepts use (compact) identifiers
Here, the important concepts are entities like cities, genes, metabolites, species, etc, etc. But also reference data sets, software, cited articles, etc. Some journals only use keywords, some journals have policies about use of identifiers for genes and proteins. Using identifiers for data and software is rare, sadly.

About accessible
  • articles can be retrieved by concept identifiers (via open, free standards)
  • article-concept identifier links are archived
  • table and figure data is annotated with concept identifiers
  • table and figure data can be accessed in an automated way
Here we see a clear problem. Publishers have been actively fighting this for years, even to today. Text miners and projects like Europe PMC are stepping in, but severely hampered by copyright law and publishers not wishing to make exception.

About interoperable
  • concept are describes common standards (many available)
  • table and figure data is available as something like CSV, RDF
Currently, the only serious standard used by the majority of (STM?) journals are MeSH terms for keywords and perhaps CrossRef XML for citations. Table and figures are more than just a graphical representations. Some journals are experimenting with this.

About reusable
  • the content of the article has a clear licence, Open Science compliant
  • the content is available with relevant standards of now
This is hard. These community standards are a moving target. For example, how we name concepts changes over time. But also identifiers themselves change over time. But a journal can be specific and accurate, which ensures that even 50 years from now, the context of the content can be determined. Of course, with proper Open Science approaches, translation to then modern community standards is simplified.

There are tons of references I can give here. If you really like these ideas, I recommend:
  1. continue reading my blog with many, many pointers
  2. read (and maybe sign) our Open Science Feedback to the Guidance on the Implementation of Plan S (doi:10.5281/zenodo.2560200), where many of these ideas are part of

Tuesday, February 05, 2019

Plan S: Less publications, but more quality, more reusable? Yes, please.

If you look at opinions published in scholarly journals (RSS feed, if you like to keep up), then Plan S is all 'bout the money (as Meja already tried to warn us):

No one wants puppies to die. Similarly, no one wants journals to die. But maybe we should. Well, the journals, not the puppies. I don't know, but it does make sense to me (at this very moment):

The past few decades has seen a significant growth of journals. And before hybrid journals were introduced, publishers tended to start new journals, rather than make journals Open Access. At the same time, the number of articles too has gone up significantly. In fact, the flood of literature is drowning researchers and this problem has been discussed for years. But if we have too much literature, should we not aim for less literature? And do it better instead?

Over the past 13 years I have blogged on many occasions about how we can make journals more reusable. And many open scientist can quote you Linus: "given enough eyeballs, all bugs are shallow". In fact, just worded differently, any researcher will tell you exactly the same, which is why we do peer review.
But the problem here is the first two words: given enough.

What if we just started publishing half of what we do now? If we have an APC-business model, we have immediately halved(!) the publishing cost. We also save ourselves from a lot of peer-review work, reading of marginal articles.

And what if we just the time we freed up for actually making knowledge dissemination better? Make journals articles actually machine readable, put some RDF in them? What if we could reuse supplementary information. What if we could ask our smartphone to compare the claims of one article with that of another, just like we compare two smartphones. Oh, they have more data, but theirs has a smaller error margin. Oh, they tried it at that temperature, which seems to work better than in that other paper.

I have blogged about this topic for more than a decade now. I don't want to wait another 15 years for journal publications to evolve. I want some serious activity. I want Open Science in our Open Access.

This is one of my personal motives to our Open Science Feedback to cOAlition S, and I am happy that 40 people joined in the past 36 hours, from 12 countries. Please have a read, and please share it with others. Let your social network know why the current publishing system needs serious improvement and that Open Science has had the answer for years now.

Help our push and show your support to cOAlition S to trigger exactly this push for better scholarly publishing:

Sunday, February 03, 2019

Plan S and the Open Science Community

Plan S is about Open Access. But Open Science is so much more and includes other aspects, like Open Data, Open Source, Open Standards. But like Publications have hijacked knowledge dissemination (think research assessment), we risk that Open Access is hijacking the Open Science ambition. If you find Open Science more important than Open Access, then this is for you.

cOAlition S is asking for feedback, and because I think Open Science is so much more, I want the Guidance on the Implementation of Plan S to have more attention for Open Science. I am submitting on Wednesday this Open Science Feedback on the Guidance on the Implementation of Plan S outlining 10 points how it can be improved to support Open Science better.

Please read the feedback document and if you agree, please join Jon Tennant and co-sign it using this form:

Wednesday, January 30, 2019

Plan S and the Preprint Servers

In no way I meant to compare Plan S to the hero Harry P....

Oh wait, but I am, and it's quite appropriate too. Harry was not a hero by himself; Harry was inevitable, he existed because of evil. Furthermore, Harry did not solve evil by himself. He needed Hermione (the scholars), he needed Ron (umm....), he needed Marcel (ummm....).  Likewise, evil has Voldemort (the impact factors), deatheathers (ummm....)... Okay, okay, let stop pushing the parallel before it gets embarrassing. Point is, Harry was insensitive, clumsy, in many ways naive. And so is Plan S. Harry did not want to have to fight Voldemort. But evil demanded Plan S, ummm, Harry to exist.

So, with the big Plan S event tomorrow in The Netherlands I am trying to organize my thoughts. We've seen wonderful discussions over the past month, which have highlighted the full setting in a lot of detail. Just this week, a nice overview of how learned societies do not make profit by making profit but spending that on important thing (doi:10.1073/pnas.1900359116 and draft blog analysis). Neither provide all the details, partly because this publishing world is not fully transparent.

Another wonderful aspect of the effect of Plan S is that people seriously talk about Open Science. Many against the current Plan S still find Open Science important, and the details of the arguments are exciting and complex. I understand most of the concerns, tho I do not believe all are realistic. For example, I honestly do not believe that researchers would turn their (financial) back at their learned societies if they moved to a full OA model (an actual argument I have heard). But then again, I'm naive myself.
Preprint servers
And people come with suggestions. Sadly, we have not seen enough of them since we started discussion Open Access, now almost 20 years ago. Fairly, better late then never, but I wish people realized that Harry was desperate in his last year at Hogwarts and someone had to do it. All other students at Hogwarts kept quite (in movie 7, you can hear some students suggest alternatives to the final battle...)

Now, last week Plan U suggests that preprint servers provide a better solution. I disagree. The current Plan U is too risky. I tweeted some considerations yesterday, which I'll put below. Let me put up front, I like preprint servers, see the 21st tweet.
  1. scholars discussing #Plan_S would do good to study the history of source code... that too started with free software ("shareware") but people quickly realized that did not work, and the community moved to #opensource 1/n
  2. Ignoring that free access is not enough and that you need open licenses is important: learn from history, don't make the same mistakes again. CC-BY-ND is not an proper open license, no license even worse. 2/n
  3. no, think about the role of preprints. First, the name preprint already makes clear it is not the same as a print. I don't care about the journal formatting, but I do care about the last edits. 3/n
  4. @jbrittholbrook used that argument in favor of ND clauses: yes, it *is* essential that we know that the version we read is accurate. Versioning is essential, changelogs even more. Is the latest preprint identical (except formatting)? With/-out ND clauses, this is critical. 4/n
  5. currently, without much effort and therefore high cost, I cannot reliably determine of a preprint version is identical (except formatting) as the published version. Those last changes are essential: that's the added value of the journal editorial role. 5/n
  6. but let's assume this gets solved (repeated errors by commercial publishers do not bode well). How about the #openscience rights (reuse, modify, reshare)? Many preprints do not require an open license. Without an open license it's merely shareware. 6/n
  7. Free reads (also temp free by journals) is nice, accept it's only thinking about now, not tomorrow. It's thinking only about yourself, not others. 7/n
  8. With a shareware article you are not allowed to share with your students. They need to download that themselves (fair, doable). You cannot include it in your coursepack. This is what reuse is about. 8/n
  9. With a shareware article you are not allowed do change it. No change of format (to match your coursepack), no text-mining, no data extraction., etc. This is what the right to modify is about. 9/n
  10. With a shareware article you are not allowed to redistribute it. I already mentioned courseware, but libraries are affected by this. But resharing is also about changing your improved version. 10/n
  11. Resharing after removal of that glaring typo. After rewriting German into English. Old-English into modern English. After fixing this number typo in that table that caused you time to figure out what the hell the authors were thinking (true story). 11/n
  12. These three core #openscience right (reuse, modify, reshare) are essential to science. Just think what would happen if you could not use a new theory/method published? Would you accept that? 12/n
  13. My guess is: No, but you do when it comes to articles. Why? Is money more important that the essence of doing science? Are society activities more important that these basic things? I hope not. 13/n
  14. this is not a discussion of the now. This was 1947 14/n:
  15. of course, one can argue that if you can read the paper, you have all the access you need. But we know this is false. Text mining is essential. Reformatting, data extraction, is essential. 15/n
  16. We now spend millions of extracting knowledge from articles, bc we decided a PDF was the proper way to share knowledge. Disallowing that makes it even more expensive. Money that could be spend on actual research. 16/n
  17. now back to preprints and preprints as a replacement to openaccess articles. I think you see where I am going. 17/n
    1. a preprint without a proper license is not #openscience and does not optimally help raise the level of science.
    2. a preprint not identical (except formatting) to the published version, is not a replacement of the published version 18/n
  19. unfortunately, many journal-preprint server combinations do not simply not guarantee a way forward. That must be solved and currently is not an alternative to #Plan_S. Current preprint servers have a different purpose: release soon. 19/n
  20. preprints have many (better) alternatives (open notebook science, #opensource, #etc, etc), but they can be a step forward towards #openscience, but if, and only if, they follow the basics of doing (open) science 20/n
  21. I'll wrap up this thread with linking to my first preprint, of Aug 2000: … It's part of the Chemistry Preprint Server (CPS) archive, hosted by Elsevier. More about CPS in #Scholia: … 21/21