Sunday, April 26, 2015

"Open Data in Science"

Recently, I got invited to a meeting of Eindhoven's Social Media Club, which has interesting meetings in the knowledge city capital of The Netherlands [ref]. This months topic was Open Data and I was asked to present Open Data in research, which I eagerly accepted. The quite liked the title too: The great wide Open Data.

I very much enjoyed the other presentation too, mostly by Allard Couwenberg, whom gave an excellent introduction into Open Data, which simplified my presentation, allowing me to focus on the role of Open Data in research and possible at universities. For example, I discussed that I think we can improve the quality of our education of we improve the access to knowledge for our students. I got great questions from the audience, mostly consisting of people outside the scholarly community, and including a few people working with Open Data a lot. A full storify is available.

I have uploaded my slides to SpeakerDeck:

But I only today sent the slides around today, because I just spent (for the first time ever) annotation my slides with source information (the last two slides).

Also, for the first time, I really felt I could have spoken for much longer. While I was able to mention a number of Open Data initiatives, like the Open Knowledge Foundation and its Dutch Open Science working group, WikiData and Wikidata4Research, the Blue Obelisk movement, the Open Notebook Science Challenge, Open Source Malaria, and crowdsourcing initatives like Mark2Cure, I realized there is so much around nowadays, that this can no longer be covered in a single presentation.

Congrats to the scholarly Open Data community!

Saturday, April 18, 2015

Bioclipse 2.6.2 with recent hacks #1: Wikidata & Linked Data Fragments

Bioclipse dialog to upload chemical
structures to an OpenTox repository.
Us chem- and bioinformaticians have it easy when it comes to Open Science. Sure, writing documentation, doing unit testing, etc, takes a lot of time, but testing some new idea is done easily. Yes, people got used to that, so trying to explain that doing it properly actually takes long (documentation, unit testing) can be rather hard.

Important for this is a platform that allows you to easy experiment. For many biologists this environment is R or Python. To me, with most of the libraries important to me written in Java, this is Groovy (e.g. see my Groovy Cheminformatics book) and Bioclipse (doi:10.1186/1471-2105-8-59). Sometimes these hacks grow to be full papers, like with what started with OpenTox support (doi:10.1186/1756-0500-4-487) which even paved (for me) the way to the eNanoMapper project!

But often these hacks are just for me personal, or at least initially. However, I have no excuse to not make this available to a wider audience too. Of course, the source code is easy, and I normally have even the smallest Bioclipse hack available somewhere on GitHub (look for bioclipse.* repositories). But it is getting even better, now that Arvid Berg (Bioclipse team) gave me the pointers to ensure you can install those hacks, taking advantage from Uppsala's build system.

So, from now on, I will blog how to install Bioclipse hacks I deem useful for a wider audience, starting with this post on my Wikidata/Linked Data Fragments hack I used to get more CAS registry number mappings to other identifiers.

Install Bioclipse 2.6.2
The first thing you need is Bioclipse 2.6.2. That's the beta release of Bioclipse, and required for my hacks. From this link you can download binary nightly builds for GNU/Linux, MS-Windows, and OS/X. For the first two 32 and 64 bit build are available. You may need to install Java and version 1.7 should do fine. Unpack the archive, and then start the Bioclipse executable. For example, on GNU/Linux:

  $ tar zxvf Bioclipse.2.6.2-beta.linux.gtk.x86.tar.gz
  $ cd Bioclipse.2.6.2-beta/
  $ ./bioclipse

Install the Linked Data Fragments manager
The default update site already has a lot of goodies you can play with. Just go to Install → New Feature.... That will give you a nice dialog like this one (which allows you to install the aforementioned Bioclipse-OpenTox feature):

But that update site doesn't normally have my quick hacks. This is where Arvid's pointers come in, which I hope to carefully reproduce here so that my readers can install other Bioclipse extensions too.

Step 1: enable the 'Advanced Mode'
The first step is to enable the 'Advanced Mode'; that is, unless you are advanced, forget about this. Fortunately, the fact that you haven't given up on reading my blog yet is a good indicated you are advanced. Go to the Window → Preferences menu and enable the 'Advanced Mode' in the dialog, as shown here:

When done, click Apply and close the dialog with OK.

Step 2: add an update site from the Uppsala build system
The first step enables you to add arbitrary new update sites, like update sites available from the Uppsala build system, by adding a new menu option. To add new update sites, use this new menu option and select Install → Software from update site...:

By clicking the Add button, you go this dialog where you should enter the update site information:

This dialog will become a recurrent thing in this series, though the content may change from time to time. The information you need to enter is (the name is not too important and can be something else that makes sense to you):

  1. Name: Bioclipse RDF Update Site
  2. Location:

After clicking OK in the above dialog, you will return to the Available Software dialog (shown earlier).

Step 3: installing the Linked Data Fragments Feature
The  Available Software dialog will now show a list of features available from the just added update site:

You can see the Linked Data Fragments Feature is now listed which you can select with the checkbox in front of the name (as shown above). The Next button will walk you through a few more pages in this dialog, providing information about dependencies and a page that requires you to accept the Open Source licenses involved. And at the end of these steps, it may require you to reboot Bioclipse.

Step 4: opening the JavaScript Console and verify the new extension is installed
Because the Linked Data Fragments Feature extends Bioclipse with a new, so-called manager (see doi:10.1186/1471-2105-10-397), we need to use the JavaScript Console (or Groovy Console, or Python Console, if you prefer those languages). Make sure the JavaScript Console is open, or do this via the menu Windows → Show View → JavaScript Console and type in the console view man ldf which should result in something like this:

You can also type man ldf.createStore to get a brief description of the method I used to get a Linked Data Fragments wrapper for Wikidata in my previous post, which is what you should reread next.

Have fun and looking forward to hear how you use Linked Data Fragments with Bioclipse!

Chemistry Central and the ORCID identifier

If you are a scientist you have heard about the ORCID identifier by now. If not, you have been focusing on groundbreaking research and isolated yourself from the rest of the world, just to make it perfect and get that Nobel prize next year. If you have been working on impactful research, Nobel prize-worthy, and have been blogging and tweeting about your progress, as a good Open Scholar, you know ORCID is the DOI for "research contributors" and you already have one yourself, and probably also that T-shirt with your own identifier. Mine is 0000-0001-7542-0286, and almost 1.3M other authors got one too. The list of ORCIDs on Wikipedia is growing (and Wikidata), thanks to Andy Mabbett, whom also made it possible to add your ORCID on WikiPathways.

Anyway, what I was pleased to see today that you can now log in with your ORCID identifier with the Chemistry Central article submission system (notice the green icon):

Many other publishers allow logging in with your ORCID too, which benefits many:

  1. authors who just enter a list of ORCID identifiers, instead of a long list of author names and affiliations
  2. publishers, which have a simpler submission system and get more accurate information about submitters
  3. funding agencies which can more easily track what is done with the research funding
  4. research institutes which can more easily track what their employees are studying
Don't have one yet? Get your very own ORCID here.

CC-BY with the ACS Author Choice: CDK and Blue Obelisk papers liberated

Screenshot of an old CDK-based
JChemPaint, from the first CDK paper.
CC-BY :)
Already a while ago, the American Chemical Society (ACS) decided to allow the Creative Commons Attribution license (version 4.0) to be used on their papers, via their Author Choice program. ACS members pay $1500, which is low for a traditional publisher. While I even rather seem them move to a gold Open Access journal, it is a very welcome option! For the ACS business model it means a guaranteed sell of some 40 copies of this paper (at about $35 dollar each), because it will not immediately affect the sale of the full journal (much). Some papers may sell more than that had the paper remained closed access, but many for papers that sounds like a smart move money wise. Of course, they also buy themselves some goodwill and green Open Access is just around the corner anyway.

Better, perhaps, is that you can also use this option to make a past paper Open Access under a CC-BY license! And that is exactly what Christoph Steinbeck did with five of his papers, including two on which I am co-author. And these are not the least papers either. The first is the first CDK paper from 2003 (doi:10.1021/ci025584y), which featured a screenshot of JChemPaint shown above. Note that in those days, the print journal was still the target, so the screenshot is in gray scale :) BTW, given that this paper is cited 329 times (according to ImpactStory), maybe the ACS could have sold more than 40 copies. But for me, it means that finally people can read this paper about Open Science in chemistry, even after so many years. BTW, there is little chance the second CDK paper will be freed in a similar way.

The second paper that was liberated this way, is the first Blue Obelisk paper (doi:10.1021/ci050400b), which was cited 276 times (see ImpactStory):

This screenshot nicely shows how readers can see the CC-BY license for this paper. Note that it also lists that the copyright is with the ACS, which is correct, because in those days you commonly gave away your copyright to the publisher (I have stopped doing this, bar some unfortunate recent exceptions).

So, head over to your email client and email and let them know you also want your JCICS/JCIM paper available under a CC-BY license! No excuse anymore to make your seminal work in cheminformatics not available as gold Open Access!

Of course, submitting your new work to the Journal of Cheminformatics is cheaper and has the advantage that all papers are Open Access!

Tuesday, April 14, 2015

Ambit.js 0.0.2 release: a scatterplot

Two weeks ago I made a second release of ambit.js, a small project to show the power of the eNanoMapper API (doi:10.5281/zenodo.16517). It still is a client library to use the API from JavaScript and a few weeks ago I posted a few screenshots. This post is aimed at announcing the 0.0.2 release which doesn't change a lot since the previous version, but now features online documentation:

It also has a number of online examples, which include the code behind the screenshots in the earlier post, but also a new scatterplot example (still using d3.js):

Now, this scatter plot shows basically that the data we are looking at does not show a particular correlation, but that was not really the hypothesis here anyway. However, it does show how hypotheses can be tested with the API and scatter plots.

Friday, April 10, 2015

Getting CAS registry numbers out of WikiData

doi 10.15200/winn.142867.72538

I have promised my Twitter followers the SPARQL query you have all been waiting for. Sadly, you had to wait for it for more than two months. I'm sorry about that. But, here it is:
    PREFIX wd: <>

    SELECT ?compound ?id WHERE {
      ?compound wd:P231s [ wd:P231v ?id ] .
What this query does is ask for all things (let's call whatever is behind the identifier is a "compound"; of course, it can be mixtures, ill-defined chemicals, nanomaterials, etc) that have a CAS registry identifier. This query results in a nice table of Wikidata identifiers (e.g. Q47512 is acetic acid) and matching CAS numbers, 16298 of them.

Because Wikidata is not specific to the English Wikipedia, CAS numbers from other origin will show up too. For example, the CAS number for N-benzylacrylamide (Q10334928) is provided by the Portuguese Wikipedia:

I used Peter Ertl's (doi:10.1186/s13321-015-0061-y) to confirm this compound indeed does not have an English page, which is somewhat surprising.

The SPARQL query uses a predicate specifically for the CAS registry number (P231). Other identifiers have similar predicates, like for PubChem compound (P662) and Chemspider (P661). That means, Wikidata can become a community crowdsource of identifier mappings, which is one of the things Daniel Mietchen, me, and a few others proposed in this H2020 grant application (doi:10.5281/zenodo.13906). The SPARQL query is run by the Linked Data Fragments platform, which you should really check out too, using the Bioclipse manager I wrote around that.

The full Bioclipse script looks like:
    wikidataldf = ldf.createStore(

    // P231 CAS
    identifier = "P231"
    type = "cas"

    sparql = """
    PREFIX wd:

    SELECT ?compound ?id WHERE {
      ?compound wd:${identifier}s [ wd:${identifier}v ?id ] .
    mappings = rdf.sparql(wikidataldf, sparql)

    // recreate an empty output file
    outFilename = "/Wikidata/${type}2wikidata.csv"
    if (ui.fileExists(outFilename)) {

    // safe to a file
    for (i=1; i<=mappings.rowCount; i++) {
      wdID = mappings.get(i, "compound").substring(3)
        wdID + "," + mappings.get(i, "id") + "\n"
BTW, of course, all this depends on work by many others including the core RDF generation with the Wikidata Toolkit. See also the paper by Erxleben et al. (PDF).

Erxleben, F., Günther, M., Krötzsch, M., Mendez, J., Vrandečić, D., 2014. Introducing wikidata to the linked data web. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (Eds.), The Semantic Web – ISWC 2014. Vol. 8796 of Lecture Notes in Computer Science. Springer International Publishing, pp. 50-65. URL

Mietchen, D., Others, M., Anonymous, Hagedorn, G., Jan. 2015. Enabling open science: Wikidata for research. URL

Ertl, P., Patiny, L., Sander, T., Rufener, C., Zasso, M., Mar. 2015. Wikipedia chemical structure explorer: substructure and similarity searching of molecules from Wikipedia. Journal of Cheminformatics 7 (1), 10+. URL