Wednesday, July 05, 2017

new paper: "A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury"

Figure from the article. CC-BY.
One of the projects I worked on at Karolinska Institutet with Prof. Grafström was the idea of combining transcriptomics data with dose-response data. Because we wanted to know if there was a relation between the structures of chemicals (drugs, toxicants, etc) and how biological systems react to that. Basically, testing the whole idea behind quantitative-structure activity relationship (QSAR) modeling.

Using data from the Connectivity Map (Cmap, doi:10.1126/science.1132939) and NCI60, we set out to do just that. My role in this work was to explore the actual structure-activity relationship. The Chemistry Development Kit (doi:10.1186/s13321-017-0220-4) was used to calculate molecular descriptor, and we used various machine learning approaches to explore possible regression models. Bottom line was, it is not possible to correlate the chemical structures with the biological activities. We explored the reason and ascribe this to the high diversity of the chemical structures in the Cmap data set. In fact, they selected the chemicals in that study based on chemical diversity. All the details can be found in this new paper.

It's important to note that these findings does not validate the QSAR concept, but just that they very unfortunately selected their compounds, making exploration of this idea impossible, by design.

However, using the transcriptomics data and a method developed by Juuso Parkkinen it is able to find multivariate patterns. In fact, what we saw is more than is presented in this paper, as we have not been able to support further findings with supporting evidence yet. This paper, however, presents experimental confirmation that predictions based on this component model, coined the Predictive Toxicogenocics Gene Space, actually makes sense. Biological interpretation is presented using a variety of bioinformatics analyses. But a full mechanistic description of the components is yet to be developed. My expectation is that we will be able to link these components to key events in biological responses to exposure to toxicants.

 Kohonen, P., Parkkinen, J. A., Willighagen, E. L., Ceder, R., Wennerberg, K., Kaski, S., Grafström, R. C., Jul. 2017. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nature Communications 8.

Saturday, June 24, 2017

The Elsevier-SciHub story

I blogged earlier today why I try to publish all my work gold Open Access. My ImpactStory profile shows I score 93% and note that with that 10% of the scientists in general score in that range. But then again, some publisher do make it hard for us to publish gold Open Access. And then if STM industries spreads FUD for their and only their good ("Sci-Hub does not add any value to the scholarly community.", doi:10.1038/nature.2017.22196), I get annoyed. Particularly, as the system makes young scientists believe that transferring copyright to a publisher (for free, in most cases) is a normal thing to do.

As said, I have no doubt that under current copyright law it was to be expected that Sci-Hub was going to be judged to violate that law. I also blogged previously that I believe copyright is not doing our society a favor (mind you, all my literature is copyrighted, and much of it I license to readers allowing them to read my work, copy it (e.g. share it with colleagues and students), and even modify it, e.g. allowing journals to change their website layout without having to ask me). About copyright, I still highly recommend Free Culture by Prof. Lessig (who unfortunately did not run for presidency).

To get a better understand of Sci-Hub and its popularity (I believe gold Open Access is the real solution), I looked at what literature was in Wikidata, using Scholia (wonderful work by Finn Nielsen, see arXiv). I added a few papers and annotated papers with their main subject's. I guess there must be more literature about Sci-Hub, but this is the "co-occuring topics graph" provided by Scholia at the time of writing:

It's a growing story.

As a PhD student, I was often confronted with Closed Access.

It sounds like a problem not so common in western Europe, but it was when I was a fresh student (around 1994). The Radboud's University Library certainly did not have all journals and for one journal I had to go to a research department and sit in their coffee room. Not a problem at all. Big Package deals improved access, but created a vendor lock-in. And we're paying Big Time for these deals now, with insane year-over-year inflation of the prices.

But even then, I was repeatedly confronted with not having access to literature I wanted to read. Not just me, btw, for PhD students this was very common too. In fact, they regularly visited other universities, just to make some copies there. An article basically costed a PhD a train travel and a euro or two copying cost (besides the package deal cost for the visited university, of course). Nothing much has changed, despite the fact that in this electronic age the cost should have gone down significantly, instead of up.

That Elsevier sues Sci-Hub (about Sci-Hub, see this and this), I can understand. It's good to have a court decide what is more important: Elsevier's profit or the human right of access to literature (doi:10.1038/nature.2017.22196). This is extremely important: how does our society want to continue: do we want a fact-based society, where dissemination of knowledge is essential; or, do we want a society where power and money decides who benefits from knowledge.

But the STM industry claiming that Sci-Hub does not contribute to the scholarly community is plain outright FUD. In fact, it's outright lies. The fact that Nature does not call out those lies in their write up is very disappointing, indeed.

I do not know if it is the ultimate solution, but I strongly believe in a knowledge dissemination system where knowledge can be freely read, modified, and redistributed. Whether Open Science, or gold Open Access.

Therefore, I am proud to be one of the 10 Open Access proponents at Maastricht University. And a huge thank you to our library to keep pushing Open Access in Maastricht.

Sunday, June 11, 2017

You are what you do, or how people got to see me as an engineer

Source, Wikicommons, CC-BY-SA.
Over the past 20 years I have had endless discussions into what the research is that I do. Many see my work as engineer, but I vigorously disagree. But some days it's just too easy to give up and explain things yet again. The question came up on the past few month several times again, and I am suggested to make a choice. That modern academia for you: you have to excel in something tiny, and complex and hard to explain ambition is loosing from the system based on funding, buzz words, "impact", and such. So, again, I am trying to make up my defense as to why my research is not engineering. You know what is ironic? It's all the fault of Open Science! Darn Open Science.

In case you missed it (no worries, many of the people I talk in depth about these things do, IMHO), my research is of theoretical nature (I tried bench chemistry, but my back is not strong enough for that): I am interested in how to digitally represent chemical knowledge. I get excited about Shannon entropy and books from Hofstadter. I do not get excited about "deep learning" (boring! In fact, the only fun I get out of that is pointing you to this). So, arguably, I am in the wrong field of science. One could argue I am not a biologist or chemist, but a computer scientist, or maybe philosophy (mind you, I have a degree in philosophy).

And that's actually where it starts getting annoying. Because I do stuff on a computer, people associate me with software. And software is generally seen as something that Microsoft does... hello, engineering. The fact that I publish papers on software (think CDK, Bioclipse, Jmol) does not help, of course.

That's where that darn Open Science comes in. Because I have a varied set of skills, I actually know how to instruct a computer to do something for me. It's like writing English, just to a different person, um, thingy. Because of Open Science, I can build the machines that I need to do my science.

But a true scientist does not make their own tools; they buy them (of course, that's an exaggeration, but just realize how well we value data and software citations at this time). They get loads of money to do so, just so that they don't have to make machines. And just because I don't ask for loads of money, or ask a bit of money to actually make the tools I need, you are tagged as engineer. And I, I got tricked by Open Science in fixing things, adding things. What was I thinking??

Does this resonate with experience from others? Also upset about it? What can we do about this?

(So, one of my next blog posts will be about the new scientific knowledge I have discovered. I have to say,  not as much as I wanted, mostly because we did not have the right tools yet, which I have to build first, but that's what this post is about...)

Saturday, June 10, 2017

New paper: "The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching"

This paper was long overdue. But software papers are not easy to write, particularly not follow up papers. That actually seems a lot easier for databases. Moreover, we already publish too much. However, the scholarly community does not track software citations (data citations neither, but there seems to be a bit more momentum there; larger user group?). So, we need these kind of papers, and just a version, archived software release (e.g. on Zenodo) is not enough. But, there it is, the third CDK paper (doi:10.1186/s13321-017-0220-4). Fifth, if you include the two papers about ring finding, also describing CDK functionality.

Of course, I could detail what this paper has to offer, but let's not spoil the article. It's CC-BY so everyone can readily access it. You don't even need Sci-Hub (but are allowed for this paper; it's legal!).

A huge thanks to all co-authors, John's work as release manager and great performance improvements as well as code improvement, code clean up, etc, and all developers who are not co-author on this paper but contributed bigger or smaller patches over time (plz check the full AUTHOR list!). That list does not include the companies that have been supporting the project in kind, tho. Also huge thanks to all the users, particularly those who have used the CDK in downstream projects, many of which are listed in the introduction of the paper.

And, make sure to follow John Mayfield's blog with tons of good stuff.

Saturday, May 20, 2017

May 29, Delft, The Netherlands: "Open Science: the National Plan and you"

In less than ten days, a first national meeting is organized in Delft, The Netherlands, where researchers can meet researchers to talk about Open Science. Mind you, researcher is very broad: it is anyone doing research, at home (e.g. citizen science, or as a hobby), at work (company or research institute), or in educational setting (university, HBOs, ...). After all, anyone benefits from Open Science (at least from that by others! "Standing on the shoulders of Open Science, ...")

The meeting is part of the National Plan Open Science (see also Open Science is already a thing in The Netherlands), which is a direct result of the Open Science meeting in Amsterdam during the Dutch presidency which resulted in the Amsterdam Call for action on Open Science.

The program for the #npos2017 meeting is very interactive. It starts with obligatory introductions, explaining how Open Science fits into the national future research landscape, but quickly moves to practical experiences from researchers, a Knowledge Commons session where everyone can show and discuss their Open Science works (with a free lunch: yes, #OpenScience and free lunches are compatible), a number of breakout sessions where the "but how" can be discussed and answered (topics in the image below), and a wrap up panel to wrap up the break out sessions, and a free drink afterwards.

During the Knowledge Commons I will join Andra Waagmeester (Micelio) and Yaroslav Blanter (Delft University) to show Wikidata, and how I have been using this for data interoperability for the WikiPathways metabolism pathways (via BridgeDb).

The meeting is free and you can sign up here. Looking forward to meeting you there!

Sunday, April 16, 2017

GenX spill, national coverage, but where is the data

First (I have never blogged much about risk and hazard), I am not an toxicological expert nor a regulator. I have deepest respect for both, as these studies are one of the most complex ones I am aware off. It makes rocket science look dull. However, I have quite some experience in the relation chemical structure to properties and with knowledge integration, which is a prerequisite for understanding that relation. Anything I do does not say what the right course of action is. Any new piece of knowledge (or technology) has pros and cons. It is science that provides the evidence to support finding the right balance. It is science I focus on.

The case
The AD national newspaper reported spilling of the compound with the name GenX in the environment and reaching drinking water. This was picked up by other newspapers, like de VK. The chemistry news outlet C2W commented on the latter on Twitter:

Translated, the tweet reports that we do not know if the compound is dangerous. Now, to me, there are then two things: first, any spilling should not happen (I know this is controversial, as people are more than happy to repeatedly pollute the environment, just because of self-interest and/or laziness); second, what do we know about the compound? In fact, what is GenX even? It certainly won't be "generation X", though we don't actually know the hazard of that either. (We have IUPAC names, but just like with the ACS disclosures, companies like to make up cryptic names.)

But having working on predictive toxicology and data integration projects around toxicology, and for just having a chemical interest, I started out searching what we know about this compound.

Of course, I need an open notebook for my science, but I tend to be sloppy and mix up blog posts like this, with source code repositories, and public repositories. For new chemicals, as you could read earlier this weekend, Wikidata is one of my favorites (see also doi:10.3897/rio.1.e7573). Using the same approach as for the disclosures, I checked if Wikidata had entries for the ammonium salt and the "active" ingredient FRD-903 (fairly, chemically they are different, and so may their hazard and risk profiles). Neither existed, so I added them using Bioclipse and QuickStatements (a wonderful tool by Magnus Manke): GenX and FRD-903. So, a seed of knowledge was planted.
    A side topic... if you have not looked at yet, please do. It allows you to annotate (yes, there are more tools that allow that, but I like this one), which I have done for the VK article:

I had a look around on the web for information, and there is not a lot. A Wikidata page with further identifiers then helps tracking your steps. Antony Williams, previous of ChemSpider fame, now working on the EPA CompTox Dashboard, added the DTX substance IDs, but the entries in the dashboard will not show up for another bit of time. For FRD-903 I found growth inhibition data in ChEMBL.

But Nina Jeliazkova pointed me to her LRI AMBIT database (poster abstract doi:10.1016/j.toxlet.2016.06.1469, PDF links) that makes (public) data from ECHA available from REACH dossiers in a machine readable way (see this ECHA press release), using their AMBIT software (doi:10.1186/1758-2946-3-18). (BTW, this makes the legal hassle Hartung had last year even more interesting, see doi:10.1038/nature.2016.19365). After creation of a free login, you can find a full (public) dossier with information about the toxicology of the compound (toxicity, ecotoxicity, environmental fate, and more):

I reported this slide, as they worry seems to be about drinking water, so, oral toxicity seems appropriate (note, this is only acute toxicity). The LD50 is the median lethal dose, but is only measured for mouse and rat (these are models for human toxicity, but only models, as humans are just not rats; well, not literally, anyway). Also, >1 gram per kilogram body weight ("kg bw"; assumption) seems pretty high. In my naive understand, the rat may be the canary in the coal mine. But let me refrain from making any conclusions. I leave that to the experts on risk management!

Experts like those from the Dutch RIVM, which wrote up this report. One of the information they say is missing is that of biodistribution: "waar het zich ophoopt", or in English, where the compound accumulates.

Friday, April 14, 2017

The ACS Spring disclosures of 2017 #2: some history

Bethany Halford adds some history about the sessions (see part #1):
    I believe Stu Borman was the first to cover the Division of Medicinal Chemistry’s First Time Disclosures symposium for C&EN, but it was Carmen Drahl who began the practice of hand-drawing and tweeting the clinical candidates as they were disclosed in real time. This seems like an oddball practice to folks who aren’t at the meeting. Why not just take a picture of the relevant slide? Well, that’s against the rules: There are signs all over the ACS National Meeting stating that photos, video, and audio recording of presentations are strictly prohibited. In San Francisco, symposium organizer Jacob Schwarz repeatedly reminded attendees that this was the case. Carmen’s brilliant idea to get around this rule was to simply draw the structures as they were presented, snap a photo, and then tweet it out.

    I’ve inherited the task since Carmen left the magazine a couple of years ago. I find it incredibly stressful. For an even that’s billed as a disclosure, the actual disclosing is fairly fleeting. The structures are often not on the screen for very long, and I’m never confident that I’ve got it 100% right. Last year in San Diego I tweeted out one structure and I heard the following day from Anthony Melvin Crasto, a chemist in India, that based on the patent literature he thought I had an atom wrong. I was certain that I had written this structure correctly, so I contacted the presenting scientist. He had disclosed the wrong structure!

    I agree that there should be some sort of database established afterwards, and I think you all have done great work on that front. I think you’ll find the pharmaceutical companies reluctant to help you out in any way. They guard these compounds so fiercely that it often makes we wonder why we have this symposium to begin with.

The ACS Spring disclosures of 2017 #1

At the American Chemical Society meetings drug companies disclose recent new drugs to the world. Normally, the chemical structures are already out in the open, often as part of patents. But because these patents commonly discuss many compounds, the disclosures are a big thing.

Now, these disclosure meetings are weird. You will not get InChIKeys (see doi:10.1186/s13321-015-0068-4) or something similar. No, people sit down with paper, manually redraw the structure. Like Carmen Drahl has done in the past. And Bethany Halford has taken over that role at some point. Great work from both! The Chemical & Engineering News has aggregated the tweets into this overview.

Of course, a drug structure disclosure is not complete if it does not lead to deposition in databases. The first thing is to convert the drawings into something machine readable. And thanks to the great work from John May on the Chemistry Development Kit and the OpenSMILES team, I'm happy with this being SMILES. So, we (Chris Southan and me) started a Google Spreadsheet with CCZero data:

I drew the structures in Bioclipse 2.6.2 (which has CDK 1.5.13) and copy-pasted the SMILES and InChIKey into the spreadsheet. Of course, it is essential to get the stereochemistry right. The stereochemistry of the compounds was discussed on Twitter, and we think we got it right. But we cannot be 100% sure. For that, it would have been hugely helpful if the disclosures included the InChIKeys!

As I wrote before, I see Wikidata as a central resource in a web of linked chemical data. So, using the same code I used previously to add disclosures to Wikidata, I created Wikidata items for these compounds, except for one that was already in the database (see the right image). The code also fetches PubChem compound IDs, which are also listed in this spreadsheet.

The Wikidata IDs link to the SQID interface, giving a friendly GUI, one that I actually brought up before too. That said, until people add more information, it may be a bit sparsely populated:

But others are working on this series of disclosures too, and keep an eye on this blog post, as others may follow up with further information!

Saturday, April 01, 2017

Closed access book chapters, Bookmetrix, and job creations

Enjoying my Saturday morning (you'll can actually track down that I write more blog posts then, than any other time of the week) with a coffee (no, not beer, Christoph). Wanted to complete my Scholia profile (gree work by Finn, arxiv:1703.04222, happy to have contributes ideas and small patches) a bit more (or perhaps that of the Journal of Cheminformatics), as that relaxes me, and nicely complements rerunning some Bioclipse scripts to add metabolite/compound data to Wikidata (e.g. this post). Because this afternoon I want to do some serious work, like write up outlines for a few cool grant applications. And if lucky, I may be able to do a bit of work on this below-the-radar project.

So, I started updating a full work available at for a peer-reviewed IEEE paper (doi:10.1109/BIBM.2014.6999367), as it is not old Open Access, and I have to rely on green Open Access. Then I headed over to my ImpactStory profile and ran into a closed Open Access book chapter with Tony, Sean, and Ola (doi:10.1007/978-1-62703-050-2_10). But I have no idea if I can put online a green Open Access version of this book chapter.

Now, why I am blogging this (and meanwhile, adding four new DTXSIDs to Wikidata), is two observiations. First, I had not blogged about Bookmetrix yet, a cool project that reports the impact of book chapters. The ROI on writing book chapters I always considered as not so high, but then I saw the #altmetrics for this chapter:

Five citations is not that lot, but considering I do not cite book chapter much either. But look at that number of downloads, 2.39 thousand! Wow!

But there is another angle to that. We regularly report our societal impact, nowadays. It's part of the Dutch Standard Evaluation Protocol, or at least selected by our research institute as something to assess researchers on. Hang on, no, citations is not part of that category. But this is: the paper is sold for about 50 euro. Seriously? Yes, seriously. And apparently 2.39K people bought this chapter. I am not sure if I need to assume that this is mostly people buying the full book, which means the chapter is a lot cheaper. But the full book reports download numbers of above 50 thousand, so it seems not. Now, let's assume that a good part of the bought copies is via package deals and the average payment is half. That may sound high, but we ignore the 50k download for the full book to compensate for that.

Doing that math means that our joint book chapter contributed 60k euro to the European market. That's a full job the four of us created with this single book chapter. I'm impressed.

Thursday, March 30, 2017

March for Science #marchforscience

You cannot have missed it, and if you did, you know about it know. We're marching for science. Originating in the USA, the marches are spreading around the world, also in Europe. More than 400 was the count a week ago. One by one, European cities joined with initiatives. Science March Stockholm was the first to get my interest, but Science March Amsterdam followed soon after. So, no reason to return to Stockholm this April. Here's a map with all planned marches around the world:

Zooming in on Europe (well, part of it), we get this map:

Quite a bit of choice. We see several countries with multiple marches. The Netherlands shows the Amsterdam march, but ideas have been posed to organize a Science March in Maastricht too.

Well, I will be marching. For what? For the importance of apolitical, nonreligious facts about the world. Facts that can be proven true, but also for a world where people value facts, fulfilling the human rights for everyone, as facts don't care about race, gender, color, left, right, or nerdiness.

Our world is precious; human and nature is precious. If we choose to destroy the world or if we choose to prosper mankind and nature, let it because of neutral facts. Not wishful thinking, money, or politics.

Let's show that science (of any domain, not just life sciences, but also humanities, etc) is by everyone and for everyone. Access to knowledge is a human right, is to benefit everyone. The march is for everyone too: you do not have to be working in scientific research to join the march to express your wish to have a fact-based country.

April 22, Amsterdam and Maastricht! Join!

Tuesday, March 21, 2017

OpenAPI to the Ensembl example

Already many months ago I joined a (doi:10.1093/nar/gkv1116) workshop in Amsterdam, organized by Gert Vriend et al (see this coverage). I learned then how to register services, search, and that underneath JSON is used in the API to exchange information about the services. One neat feature is that allows you to specify a lot of detail of the service calls.

Now, at the time we had already used OpenAPI (then still called Swagger) for Open PHACTS for some time, which we later picked up for other projects, like eNanoMapper (API), WikiPathways (API), and BridgeDb (API). OpenAPI configuration files also describe how web services work. So, the idea arose to that it should be possible to convert the first to the second. Simple. I started a GitHub repository, but, of course, did not really have time to implement it.

Then, half a year ago, at the ELIXIR track meeting at the ECCB in The Hague (where I presented this BridgeDb poster), I spoke with people from ELIXIR-DK who were just starting a studentship scheme. This led to a project idea, then a proposal, and then an small, approved project, allowing me to fund Jonathan Mélius to work on this part-time, for about a man month of work, spread over several months.

Jonathan has been doing great work, and because we liked to demo the OpenAPI 2 bridge with a major European resource, Ensembl was suggested (which just published a paper on their core software). An OpenAPI for Ensembl was set up, which is going to be the primary input for the new tool:

The next step was to take the JSON defining the content of this page (you can find the URL to the JSON file at the top of that page, hosted on GitHub too), and convert that to fragments. That the approach works, shows this test entry in

The observant eye will see that various bits of details of the descriptions of the API calls are annotated with EDAM ontology (doi:10.1093/bioinformatics/btt113) terms, a key feature of This information is currently not available in the OpenAPI JSON (we will be exploring how that specification could/should be extended to do this). Moreover, the webservice API methods need ontological annotation in the first place, and we will not be able to totally remove human involvement there.

The EDAM IRIs are still hard-coded in the conversion tool at this moment, but are being factored out into a secondary JSON file for now. So, the conversion tool will take two input JSON files, OpenAPI + EDAM annotation, and create JSON output. The latter can then be inserted into the JSON. We will work on something based on the API to automate that step too.

So, we still have some work to do, but I'm happy with the current progress. We're well on track to complete this project before summer and actually get a long way with the ontology annotation, which was an secondary in the original plan.

Feedback welcome!

Saturday, March 11, 2017

What an Open Science project does: eNanoMapper deliverables archived on ZENODO

eNanoMapper has ended. It was my first EC-funded project as PI. It was great to run a three year Open Science project at this scale. I loved the collaboration with the other partners, and like to thank Lucian and Markus for their weekly coordination of the project! Lucian also reflected on the project in this blog post. He describes the successful completion of the project, and we partly thank that to the uptake of ideas, solutions, and approaches by the NanoSafety Cluster (NSC) community. Many thanks to all NSC projects, including for example NANoREG who were very early adopters!

Our legacy is substantial, I think. I have blogged about some aspects in the past. The projects output includes RRegrs for scanning the regression model space, extensions of AMBIT for substances, tools on top of the APIs, visualizations with JavaScript, etc. Things have been done Open Source and you can find many repositories on GitHub, and we used Jenkins to autobuild various components, and not just source code, but also the eNanoMapper ontology. Several software releases are archived on ZENODO, the ontology is available from BioPortal, the Ontology Lookup Service, and AberOWL (and thanks to the operators for their support to get it properly online!).

Several publications have been published, along with many tutorials. On the website you could already access many of the deliverables of the project. And last week all public deliverables are now archived on ZENODO (HT to Lucian):

Next time, I want to see if we can get the deliverables published in, for example, Research Intentions and Outcomes journal.

Finally, I like to thanks everyone else if the Maastricht University team that worked on eNanoMapper: Cristian Munteanu, who was my first post-doc, Bart Smeets, Linda Rieswijk, Freddie Ehrhart, and part-time Nuno Nunes and Lars Eijssen. Without them I could not have completed our deliverables.

Sunday, March 05, 2017

Upcoming meeting: "Open science and the chemistry lab of the future"

Following the example by Henry Rzepa, here an announcement of a meeting with a great program organized by the Beilstein Institut in Germany. The meeting does also mean I cannot attend another really important meeting, WikiCite, which has a partial overlap :(

At the Open science and the chemistry lab of the future meeting meeting I will represent ELIXIR, which is quite a challenge as they are doing so much, and I only have so much time to cover that. Worse, I am only part-time working on specific ELIXIR tasks, but fortunately getting great help from Rob Hooft of the Dutch Techcenter for Life Sciences (DTL, practically the Dutch ELIXIR node).

I am very much looking forward to meeting friends and seeing people I have only yet met online, like Stuart Chalk (who recently published the CCZero Open Spectral Database) and Open Source Malaria Matthew Todd. Oh, and if you cannot attend the meeting in person, the hashtag to follow is #BeilsteinOS. If you can join, you can register to the meeting here.

Sunday, February 19, 2017

Talk: "Making open science a reality, from a researcher perspective"

Slide from the presentations with
a screenshot of the
Woordenboek Organische Chemie.
Last week I was in Paris (wonderful, but like London, a city that makes you understand Ankh-morpork) for the AgreenSkills+ annual meeting. AgreenSkills+ is a program for postdoc funding in France and the postdocs presented their works. Wednesday (#agreenskills) was a day to learn about Open Science, with other talks from Nancy Potinka and Ivo Grigorov from Foster Open Science, Martin Donnelly from the Edinburgh Digital Curation Centre about data management and the DMPonline tool, and Michael Witt of Purdue University about digital repositories and DataCite (which I should really make time to blog aobut too).

I was asked to talk about my experiences from a researcher perspective (which started with the Woordenboek Organische Chemie). Here are my slides:

Saturday, February 18, 2017

Open Science is already a thing in The Netherlands

It has been hard to miss it: the Dutch National Plan Open Science (doi:10.4233/uuid:9e9fa82e-06c1-4d0d-9e20-5620259a6c65). It sets out an important step forward: it goes beyond Open Access publishing, which has become a tainted topic. After all, green Open Access does not provide enough rights. For example, teachers can still not share green Open Access publications with their students easily.

I am happy I have been able to give feedback on a draft version, and hope it helped. During the weeks before the release I also looked how the Open Science working group of the Open Knowledge International foundation(?) is doing, and happy that at least the Dutch mailing list is still in action. Things are a bit in a flux, as the OKI is undergoing a migration to a new platform. Maybe more about that later.

But one of my main comments was that there already is a lot of Open Science ongoing in The Netherlands. And then I am not talking about all those scientists that already publish part of their work as (gold) Open Access, but the many researchers that already share Open Data, Open Source, or other Open research outputs. In fact, I started a public (CCZero) spreadsheet with GitHub repositories of Dutch research groups, which now also covers many educational groups, at our universities and "hogescholen". This now includes some fourty(!) git repositories, mostly on GitHub but also on GitLab. Wageningen even have their own public git website!

Mind you, I had to educate myself a bit in the exact history of the term Open Science. It actually seems to go back to the USA Open Source community (see these references and particularly this article). And that's actually where I also knew it from, in particular from Dan Gezelter, founding author of the well-known Jmol viewer for small molecules and protein structures, and host of the domain.

Tuesday, January 03, 2017

Wikidata-powered citation lists with citation.js

I don't get enough time with the kids as I would like, but if your son is doing interesting coding projects it makes that a lot easier. One project he is working on is citation.js, a JavaScript library to edit bibliographies. It has become really powerful and totally awesome! We all hate formatting bibliographies and that every journal has its own format. LaTeX and Citation Style Language have done wonders here, but all should even be simpler. As an author I want to be able to just give a DOI and that should be enough.

Or a Wikidata entity identifier.

And citation.js makes that last thing possible, and I spent some time with Lars to implement this for my homepage:

This is more or less what I had before too, but then everything hard coded. The citation.js way allow me to give just a list of two entity IDs (Q27062312 and Q27062639) and citation.js outputs the above. I just have this snippet in the HTML:

      <ul class="cite" id="cite1" />
      <script class="code" type="text/javascript">
        var wikidata = new Cite()
        wikidata.set( [ "Q27062312", "Q27062639" ] )
        htmlOutput = wikidata.get( opt )
          htmlOutput.replace( /&(lt|#60);/g, '<' )
                    .replace( /&(gt|#62);/g, '>' )

The formatting is actually mostly done with a CSL template (though it needs a hack to get it to output HTML), though adapted to also output the DOI hyperlink and Altmetric icon (you can find the customized CSL in the HTML source code as CC-BY-SA 3.0). The citation.js library fetches the data from Wikidata and actually has to deal with the structure there, which includes a mixture of 'author' and 'author name string' fields for author information. Well done!

If you like this, make sure to check out Wikicite, OpenCitations, and Scholia, projects that enabled and triggered some of the ideas behind the above citation.js use!

"10 everyday things on the web the EU Commission wants to make illegal" #04

Fourth example is harder then the third and I hope I got the translation of Julia Reda's example in good way. The starting point is simple enough, bookmarking things where an image is used. However, I am less sure to what extend we use this in online science.

04. Pinning a photo to an online shopping list

Well, you can see how much trouble I had with finding a good equivalent here. So, what is a science shopping list? The above example shows a Google+ post by Björn Brembs. Now, G+ is not really a shopping list, but then again, literature is what researchers buy. Literally. We pay millions and millions for it. Second, we do have dedicated shopping lists for these products, but they not always support images. Of course, these shopping lists are our CiteULike, Mendeley, ResearchGate, etc accounts.

Second limitation of this example is that we would not consider most of our literature of journalistic nature. Therefore the above example. Blogs are typically a mixture of science writing and a kind of journalism. It's a grey area. Now, under the new laws, Björn would have to ask my permission, and worse, G+ needs to install a monitoring system to see if Björn got a proper license as to not break my copyright.

So, back to the likes of ResearchGate and ScienceOpen. With the current proposal, any system of this kind with some commercial model in mind (both are set up by SMEs), they will have to install this monitoring system (after all, we also happily bookmark Nature News articles). The cost of that investment will have to come from somewhere, so this has an enormous impact on their sustainability.

Even worse, the wordings in the proposal I have seen so far, and to the extend I understand Julia's worries, there are no limitations set on this; few or no words on allowed behavior. So, what about dissemination systems in general? I think later examples (we still have six to go!), will shed more light on that.

(And make sure to read the original article by Julia Read!)

Monday, January 02, 2017

EPA CompTox Dashboard IDs in Wikidata

After Antony Williams left the ChemSpider team, he moved on to the EPA. Since then, he has set up the EPA CompTox Dashboard (see also doi:10.1007/s00216-016-0139-z [€]). And in August he was kind enough to upload mappings between InChIKeys (doi:10.1186/s13321-015-0068-4) and their identifiers on Figshare (doi:10.6084/m9.figshare.3578313.v1) as a tab-separated values (TSV) file. Because this database is of interest to our pathway and systems biology work, I realized I wanted ID-ID mappings in our BridgeDb identifier mappings files (doi:10.1186/1471-2105-11-5). As I wrote earlier, I have adopted Wikidata (doi:10.3897/rio.1.e7573) as data source. So, entering these new identifiers in Wikidata is helpful.

Somewhere in the past few months I proposed the needed Wikidata property, P3117 ("DSSTOX substance identifier"), which was approved some time later. For entering the mappings, I have opted to write a Bioclipse script (doi:10.1186/1471-2105-10-397) that uses the Wikidata SPARQL endpoint to get about 150 thousand Wikidata item identifiers (Q-codes) and their InChIKeys. I then parses over the lines in the TSV file from Figshare and creates input for Wikidata for each match, based on exact InChIKey string equivalence.

This output is formatted QuickStatements instructions, a great tool set up by Magnus Manske. Each line looks like (here for N6-methyl-deoxy-adenosine-5'-monophosphate, aka Q27456455):

Q27456455 P3117 "DTXSID30678817" S248 Q28061352

The P248 ("stated in") property is used to link the source (hence: S248) information as reference, with points to the Q28061352 item which is for the Figshare entry for Tony's mapping data. The result in this Wikidata item looks like:

I entered about 36 thousand of such statements to Wikidata. Thus, the yield is about 5%, calculating from the CompTox Dashboard as starting point with about 720 thousand identifiers. From a Wikidata perspective, the yield is higher. There are about 150 thousand items with an InChIKey, so that 24% could be mapped.

Based on properties of the property, it does some automatic validation. For example, it is specified that any Wikidata item can only have one DSSTOX substance identifier, because it can only have one InChIKey too. Similarly, there can not be two Wikidata items with the same DSSTOX identifier. Normally, because because of how Wikidata works, there can be isolated examples. With less then 25 constraint violations, the quality of the process turned out pretty high (>99.9%).

Some of the issues have been manually inspected. Causes vary. One issue was that the Wikidata item in fact had more than one InChIKey. A possible reason for that is that it does not distinguish between various forms of a compound. Two Wikidata items have been split up accordingly. Other problems are due to features of the CompTox Dashboard, and some issues have been tweeted to the Dashboard team.

This mashup of these two resources, as anticipated in our H2020 proposal (doi:10.3897/rio.1.e7573), makes it possible to easily make slices of data. For example, we can query for experimental data for compounds in the EPA CompTox Dashboard with a SPARQL query like for the dipole moment:

Importantly, this query shows the source where this data comes from, one of the advantages of Wikidata.