Wednesday, April 24, 2019

Open Notebook Science: the version control approach

Jean-Claude Bradley pitched the idea of Open Notebook Science, or Open-notebook science as the proper spelling seems to be. I have used notebooks a lot, but ever since I went digital, the use went down. During my PhD studies I still extensively used them. But in the process, I changed my approach. Influenced by open source practices.

After all, open source has had a long history of version control, where commit messages explain the reason why some change was made. And people that ever looked at my commits, know that my commits tend to be small. And know that my messages describe the purpose of some commit.

That is my open notebook. It is essential to record why a certain change was made and what exactly that change was. Trivial with version control. Mind you, version control is not limited to source code. Using the right approaches, data and writing can easily be tracked with version control too. Just check, for example, my GitHub profile. You will find journal articles been written, data collected, just as if they were equal research outputs (they are).

Another great example of version control for writing and data is provided by Wikipedia and Wikidata. Now, some changes I found hard to track there: when I asked the SourceMD tool (great work by Magnus Manske) to create items for books, I want to see the changes made. The tool did link to the revisions made at some point, but this service integration seems to break down now and then. Then I realized that I could use the EditGroups tool directly (HT to who wrote that), and found this specific page for my edits, which includes not just those via SourceMD but also all edits I made via QuickStatements (also by Magnus):

If only I could give a "commit message" which each QuickStatements job I run. Can I?

Saturday, April 13, 2019

Bioclipse on the command line

Screenshot of Bioclipse 2.
Over the past seven years there has been a lot of chemistry interoperability work I have done using Bioclipse (doi:10.1186/1471-2105-8-59, doi:10.1186/1471-2105-10-397). The code is based on Eclipse, which gives a great GUI experience, but also turned out hard to maintain. Possibly, that was because of a second neat feature, that you could plugin libraries into the Python, JavaScript, and Groovy scripting environment which allows people to automate things in Bioclipse. Over the course of time, so many libraries have been integrated, making so many scientific toolkit available at the tip of your fingers. Of the three programming languages, I have used Groovy the most, being close to the Java language, but with with a lot of syntactic goodies.

In fact, I have blogged about the scripts I wrote on my occasions and in 2015 I wrote up a few blog posts on how to install new extensions:

But publishing and installing new Bioclipse 2.6.2 extension remained complicated (installing Bioclipse itself it quite trivial). And that while the scripts are so useful, and I need others to start using them. I do not scale. Second, when I cite these scripts, they were too hard to use by reviewers and readers. To get some idea of a small subset of the functionality, read our book A lot of Bioclipse Scripting Language examples.

So, last x-mas I set out with the wish to be able to have others much more easily run my scripts and, second, be able to run them from the command line. To achieve that, installing and particularly publishing Bioclipse extensions had to become much easier. Maybe as easy of Groovy just Grab-bing the dependencies from the script itself. So, Bioclipse available from Maven Central, or so.

Of course, this approach would likely loose a lot of wonderful functionality, like the graphical UX, the plugin system, the language injection, and likely more. So, one important requirements was that any script using the command line must be identical to the script in Bioclipse itself. Well, with a few permissible exceptions: we are allowed to inject the Bioclipse managers manually.

Well, of course, I would not have been blogging this had I not succeeded to reach these goals in some way. Indeed, following up from a wonderful metaRbolomics meeting organized by de.NBI (~ ELIXIR Germany), and the powerful plans discussed with Emma Schymanski (and some ongoing work of persistent toxicants), and, fairly, actually not drowning in failed deadlines, just regularly way behind deadlines, and since I have a research line to run, I dived into hackmode. In some 14 hours, mostly in the evening hours of the past two days, I got a proof of principle up and running. The name is a reference to all the wonderful linguistic fun we had when I worked in Uppsala, thanks to Carl Mäsak, e.g. discussing the term Bioclipse Scripting Language and Perl 6.

It is not available yet from Maven Central, so there is a manual mvn clean install involved at this moment, but after that (the command installs it in your local Maven repository which will be recognized by Groovy), you can get started with something like (I marked in blue to extra sugar needed on the command line; the black code runs as is in Bioclipse 2.6.2):


workspaceRoot = "."
def cdk = new net.bioclipse.managers.CDKManager(workspaceRoot);

list = cdk.createMoleculeList()
println list
println cdk.fromSMILES("COC")

What now?
In the future, once it is available on Maven Central, you will be able to skip the local install command, and @Grab will just fetch things from that online repository. I will be tagging version 0.0.1 today, as I got my important script running that takes one or more SMILES strings, checks Wikidata, and makes QuickStatements to add missing chemicals. The first time you've (maybe) seen that, was three years ago, in this blog post.

You may wonder: why?? I asked myself the same thing, but there are a few things over the past 24 hours that I could answer and which may sketch where this is going:

  1. that BSL book can actually show running the code and show the output in the book, just like with my CDK book;
  2. maybe we can use Bioclipse managers in Nextflow;
  3. Bioclipse offers interoperability layers, allowing me to pass a chemical structure from one Java library to another (e.g. from the CDK to Jmol to JOELib);
  4. it allows me to update library versions without having to rebuild a full new Bioclipse stack (I'm already technically unable, let alone timewise unable);
  5. I can start sharing Bioclipse scripts with articles that people can actually run; and,
  6. all scripts are compatible, and all extensions I make can be easily copied into the main Bioclipse repository, if there ever will be a next major Bioclipse version (seems unlike now).

Now, it's just being patient and migrating manager by manager. It may be possible to use the the existing manager code, but that comes with so much language injection, that I decided to just take advantage of Open Science and just copy/paste the code. Most of the code is the same, minus progress monitors, and replacing Eclipse IFile code with regular Java code. But there are tons of managers, and reaching even 50% coverage will take, at the speed I can offer, months. Therefore, I'll focus on scripts I share with others, focus on reuse and reproducibility.

More soon!

Sunday, April 07, 2019

History of the term Open Science #1: the early days

Screenshot of the Open Science History group
on CiteULike.
Open Science has been around for some time. Before Copyright became a thing, knowledge dissemination was mostly limited by how easy you could get knowledge from one place to another. The introduction of Copyright changed this. No longer the question was how to get people to know the new knowledge to how to get people to pay for new knowledge. One misconception, for example, is that publishing is a free market. Yes, you can argue that you can publish anywhere you like (theoretically, at least, but reality says otherwise), but the monopoly is in getting access: for every new fact (and republishing the same fact is a faux pas), there is exactly one provider of that fact.

Slowly this is changing, but only slowly. What this really needs, is open licenses, just like open source licenses. Licenses that allow fixing typos, allow resharing with your students, etc.

But contrary to what has been prevalent in the Plan S discussion, these ideas are not new. And people have been trying Open Science for more than two decades already.

I have been trying to dig up the oldest references (ongoing effort) of the term Open Science (in the current meaning), and had a CiteULike group for that. But CiteULike is shutting down, so I will blog the references I found, and add some context.

A first article to mention is this 1998 article that mentions Open Science: Common Agency Contracting and the Emergence of "Open Science" Institutions The American Economic Review, Vol. 88, No. 2. (May 1998), pp. 15-21 by Paul A. David. Worth reading, but does require reading some of the cited literature.

The follow two magazine articles took the term Open Science to a wider public, and in reply to a conference held at Brookhaven National Laboratory:

I would also like to note that the website by Dan Gezelter went online in the late nineties already, which I have used in various of my source code projects, and, of course, also has been used by the Chemistry Development Kit from the start.

Wednesday, April 03, 2019

BioSchemas CreativeWork annotation in Bioconductor Vignettes

Since the EU BioHackathon in Paris last year, I've picked up Bioschemas stuff more extensively, to help the ELIXIR Metabolomics and Toxicology (in development) communities getting their stuff more FAIR. We could annotate training material already (see this ELIXIR NL post), but big boon was annotation of vignettes on Bioconductor. Over the past 2-3 months I have been exploring this, and on Monday at the #metaRbolomics meeting in Wittenberg, with a room full or R users, I got the right pointers to a promising lead.

So, because the vignettes are generated differently than Markdown on GitHub, I had to find the right hooks. In the end, I found these in one vignette adding a Google Analytics tracker in the header of the file. Bingo!

Screenshot of Google's Structured Data Testing Tool.
So, here's how to do it. The R package set up allows adding custom HTML to the generated HTML, either in the header and at the start or end of the body. I went for the header. But I had to wait two days for the BioConductor website to make a new BridgeDbR package binary (1st day), and for it to update the website (2nd day).

The HTML snippet (saved as bioschemas.html) to add is basically a <script> element with a fragment of JSON-LD:

<script type="application/ld+json">
  "about":"This tutorial describes how to use the BridgeDbR package for identifier mapping.",
  "name":"BridgeDbR Tutorial",
    "name":"Egon Willighagen",
  "difficultyLevel": "beginner",
  "keywords":"ELIXIR RIR, BridgeDb",

The other half of the story is to instruct the HTML generation pipeline to add it, which is done with this bit of YAML in your Markdown file (part of it you should already have):

    toc_float: true
      in_header: bioschemas.html

Check the full folder here.