Saturday, November 17, 2018

Join me in encouraging the ACS to join the Initiative for Open Citations

My research is into abstract representation of chemical information, important for other research to be performed. Indeed, my work is generally reused, but knowing which research fields my work is used in, or which societal problems it is helping solve, is not easily retrieved or determined. Efforts like WikiCite and Scholia do allow me to navigate the citation network, so that I can determine which research fields my output influences and which diseases are studied with methods I proposed. Here's a network of topics of articles citing my work:

Graphs like this show information on how people are using my work, which in turn allows me to further support. But this relies on open citations.

In my opinion, citations are an essential part of our research process. It gives us access to import prior work on which a study is based, and reflects how a work influences other research or even is essential to that other work. For example, it allows us to not repeat earlier published work, while preserving the ability to reproduce the full work. The Initiative for Open Citations encourages these citations to be publicly available to benefit research, but removing barriers to access this critical part of scholarly communication. While many societies and publishers have joined this initiative, the American Chemical Society (ACS) has not yet. By not joining the limit the sharing of knowledge for unclear reasons.

And I would really like to see the ACS to join this initiative, and proposed this a few times already. Because they still have not joined the initiative, I have started this petition. If you agree, please sign and share it with others.

New paper: "Explicit interaction information from WikiPathways in RDF facilitates drug discovery in the Open PHACTS Discovery Platform"

Figure from the article showing the interactive
Open PHACTS documentation to access
Ryan, PhD candidate in our group, is studying how to represent and use interaction information in pathway databases, and WikiPathways specifically. His paper Explicit interaction information from WikiPathways in RDF facilitates drug discovery in the Open PHACTS Discovery Platform (doi:10.12688/f1000research.13197.2) was recently accepted in F1000Research, which extends on work started by, among others, Andra (see doi:10.1371/journal.pcbi.1004989).

The paper describes the application programming interfaces (API) methods of the Open PHACTS REST API for accessing interaction information, e.g. to learn which genes are upstream of downstream in the pathway. This information can be used in pharmacological research. The paper discussed examples queries and demonstrates how the API methods can be called from HTML+JavaScript and Python.

Sunday, November 04, 2018

Programming in the Life Sciences #23: research output for the future

A random public domain
picture with 10 in it.
Ensuring that you and others can understand you research output five years from now requires effort. This is why scholars tend to keep lab notebooks. The computational age has perhaps made us a bit lazy here, but we still make an effort. A series of Ten Simple Rules articles outline some of the things to think about:
  1. Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, Crosas M, et al. Ten Simple Rules for the Care and Feeding of Scientific Data. Bourne PE, editor. PLoS Computational Biology. 2014 Apr 24;10(4):e1003542.
  2. List M, Ebert P, Albrecht F. Ten Simple Rules for Developing Usable Software in Computational Biology. Markel S, editor. PLOS Computational Biology. 2017 Jan 5;13(1):e1005265.
  3. Perez-Riverol Y, Gatto L, Wang R, Sachsenberg T, Uszkoreit J, Leprevost F da V, et al. Ten Simple Rules for Taking Advantage of Git and GitHub. Markel S, editor. PLOS Computational Biology. 2016 Jul 14;12(7):e1004947.
  4. Prlić A, Procter JB. Ten Simple Rules for the Open Development of Scientific Software. PLoS Computational Biology. 2012 Dec 6;8(12):e1002802.
  5. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for Reproducible Computational Research. Bourne PE, editor. PLoS Computational Biology. 2013 Oct 24;9(10):e1003285.
Regarding licensing, I can highly recommend reading this book:
  1. Rosen L. Open Source Licensing [Internet]. 2004. Available from:
Regarding Git, I recommend these two resources:
  1. Wiegley J. Git From the Bottom Up [Internet]. 2017. Available from:
  2. Task 1: How to set up a repository on GitHub [Internet]. 2018. Available from:

Saturday, November 03, 2018

Fwd: "We challenge you to reuse Additional Files (a.k.a. Supplementary Information)"

Download statistics of J. Cheminform.
Additional Files show a clear growth.
Posted on the BMC (formerly BioMedCentral) Research in progress blog our challenge to you to reuse additional files:
    Since our open-access portfolio in BMC and SpringerOpen started collaborating with Figshare, Additional Files and Supplementary Information have been deposited in journal-specific Figshare repositories, and files available for the Journal of Cheminformatics alone have been viewed more than ten thousand times. Yet what is the best way to make the most of this data and reuse the files? Journal of Cheminformatics challenges you to think about just that with their new upcoming special issue.
We already know you are downloading the data frequently and more every year, so let us know what you're doing with that data!

For example, I would love to see more data from these additional files end up in databases, such as Wikidata, but any reuse in RDF form would interest me.

Tuesday, October 30, 2018

Some steps needed in knowledge dissemination

Last week I had holiday from my BiGCaT position and visited Samuel Winthrop at the SpringerNature offices to discuss our Journal of Cheminformatics. It was a great meeting (1.5 days) and we discussed a lot of things we could do (or are in the process of doing) to improve using the journal format for knowledge dissemination. We have some interesting things lined up ... </suspense>

For now, check out my personal views in these slides I presented last week:

Thursday, October 11, 2018

Two presentations at WikiPathways 2018 Summit #WP18Summit

Found my way back to my room a few kilometers from the San Francisco city center, after a third day at the WikiPathways 2018 Summit at the Gladstone Institutes in Mission Bay, celebrating 10 years of the project, which I only joined some six and a half years ago.

The Summit was awesome and the whole trip was awesome. The flight was long, with a stop in Seattle. I always get a bit nervous of lay-overs (having missed my plane twice before...), but a stop in Seattle is interesting, with a great view of Mt. Rainier, which is also from an airplane quite a sight. Alex picked us up from the airport and the Airbnb is great (HT to Annie for being a great host), from which we can even see the Golden Gate Bridge.

The Sunday was surreal. With some 27 degrees Celsius the choice to visit the beach and stand, for the first time, in the Pacific was great. I had the great pleasure to meet Dario and his family and played volleyball at a beach for the first time in some 28 years. Apparently, there was an airshow nearby and several shows were visible from our site, including a very long show by the Blue Angels.
Thanks for a great afternoon!

Sunday evening Adam hosted us for an WikiPathways team dinner. His place gave a great view on San Francisco, the Bay Bridge, etc. Because Chris was paying attention, we actually got to see the SpaceX rocket launch (no, my photo is not so impressive :). Well, I cannot express in words how cool that is, to see a rocket escape the earth gravity with your own eyes.

And the Summit had not even started yet.

I will have quite a lot to write up about the meeting itself. It was a great line up of speakers, great workshops, awesome discussions, and a high density of very knowledgeable people. I think we need 5M to implement just the ideas that came up in the past three days. And it would be well invested. Anyway, more about that later. Make sure to keep an eye on the GitHub repo for WikiPathways.

That leave me only, right now, to return to the title of this post. And below they are, my two contributions to this summit:

Saturday, September 29, 2018

Two presentations of last week: NanoTox 2018 and the BeNeLuX Metabolomics Days

Slide from the BeNeLux Metabolomics Days
presentation (see below).
The other week I gave a two presentations, one at the BeNeLux Metabolomics Days in Rotterdam and the next day one at NanoTox 2018 in Neuss, Germany. During the first I spoke about research ongoing in our research group and in Neuss about the eNanoMapper project and some of the ongoing eNanoMapper projects I am involved in.

Here are the slides of both talks.

Sunday, September 16, 2018

Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)

Slice of the spreadsheet in the supplementary info.
Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the zero-APC Open Access (CC_BY) journal Beilstein of Journal of Nanotechnology, NanoE-Tox (doi:10.3762/bjnano.6.183).

The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:10.3762/bjnano.6.165) and thus with ontology annotation (see doi:10.1186/s13326-015-0005-5):

  recognizedToxicities = [
    "EC10": "",
    "EC20": "",
    "EC25": "",
    "EC30": "",
    "EC50": "",
    "EC80": "",
    "EC90": "",
    "IC50": "",
    "LC50": "",
    "MIC":  "",
    "NOEC": "",
    "NOEL": ""

With 402(!) variants left. Many do not have an ontology term yet, and I filed a feature request.


  recognizedUnits = [
    "g/L": "g/L",
    "g/l": "g/l",
    "mg/L": "mg/L",
    "mg/ml": "mg/ml",
    "mg/mL": "mg/mL",
    "µg/L of food": "µg/L",
    "µg/L": "µg/L",
    "µg/mL": "µg/mL",
    "mg Ag/L": "mg/L",
    "mg Cu/L": "mg/L",
    "mg Zn/L": "mg/L",
    "µg dissolved Cu/L": "µg/L",
    "µg dissolved Zn/L": "µg/L",
    "µg Ag/L": "µg/L",
    "fmol/L": "fmol/L",
    "mmol/g": "mmol/g",
    "nmol/g fresh weight": "nmol/g",
    "µg Cu/g": "µg/g",
    "mg Ag/kg": "mg/kg",
    "mg Zn/kg": "mg/kg",
    "mg Zn/kg  d.w.": "mg/kg",
    "mg/kg of dry feed": "mg/kg", 
    "mg/kg": "mg/kg",
    "g/kg": "g/kg",
    "µg/g dry weight sediment": "µg/g", 
    "µg/g": "µg/g"

Oh, and don't get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:

  prop = prop.replace(",", ".")
  if (prop.substring(1).contains("-")) {
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
      store, endpointIRI, "${ssoNS}has-unit", units
  } else if (prop.contains("±")) {
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
      store, endpointIRI, "${ssoNS}has-unit", units
  } else if (prop.contains("<")) {
  } else {
      store, endpointIRI, "${ssoNS}has-value", prop,
      store, endpointIRI, "${ssoNS}has-unit", units

But let me make clear: I can actually do this, add more data to the eNanoMapper database (with Nina), only because the developers of this database made their data available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They already did a lot of data curation. But only because of Open licenses, we only have to do this once.

Saturday, September 15, 2018

Wikidata Query Service recipe: qualifiers and the Greek alphabet

Just because I need to look this up each time myself, I wrote up this quick recipe for how to get information from statement qualifiers from Wikidata. Let's say, I want to list all Greek letters, with in one column the lower case and in the other the upper case letter. This is what our data looks like:

So, let start with a simple query that lists all letters in the Greek alphabet:

SELECT ?letter WHERE {
  ?letter wdt:P361 wd:Q8216 .

Of course, that only gives me the Wikidata entries, and not the Unicode characters we are after. So, let's add that Unicode character property:

SELECT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .

Ah, that gets us somewhere:

But you see that the upper and lower case are still in separate rows, rather than columns. To fix that, we need access to those qualifiers. It's all in there in the Wikidata RDF, but the model is giving people a headache (so do many things, like math, but that does not mean we should stop doing it!). It all comes down to keeping notebooks, write down your tricks, etc. It's called the scientific method (there is more to that, than just keeping notebooks, tho).

So, a lot of important information is put in qualifiers, and not just the statements. Let's first get all statements for a Greek letter. We would do that with:

?letter ?pprop ?statement .

One thing we want to know about the property we're looking at, is the entity linked to that. We do that by adding this bit:

?property wikibase:claim ?propp .

Of course, the property we are interested in is the Unicode character, so can put that directly in:

wd:P487 wikibase:claim ?propp .

Next, the qualifiers for the statement. We want them all:

?statement ?qualifier ?qualifierVal .
?qualifierProp wikibase:qualifier ?qualifier .

And because we do not want any qualifier but the applies to part, we can put that in too:

?statement ?qualifier ?qualifierVal .
wd:P518 wikibase:qualifier ?qualifier .

Furthermore, we are only interested in lower case and upper case, and we can put that in as well (for upper case):

?statement ?qualifier wd:Q98912 .
wd:P518 wikibase:qualifier ?qualifier .

So, if we want both upper and lower case, we now get this full query:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp .
  ?statement ?qualifier wd:Q8185162 .
  wd:P518 wikibase:qualifier ?qualifier .

We are not done yet, because you can see in the above example that we get the unicode character differently from the statement. This needs to be integrated, and we need the wikibase:statementProperty for that:

wd:P487 wikibase:statementProperty ?statementProp .
?statement ?statementProp ?unicode .

If we integrate that, we get this query, which is indeed getting complex:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier wd:Q8185162 ;
             ?statementProp ?unicode .  
  wd:P518 wikibase:qualifier ?qualifier .

But basically we have our template here, with three parameters:
  1. the property of the statement (here P487: Unicode character)
  2. the property of the qualifier (here P518: applies to part)
  3. the object value of the qualifier (here Q98912: upper case)
If we use the SPARQL VALUES approach, we get the following template. Notice that I renamed the variables of ?letter and ?unicode. But I left the wdt:P361 wd:Q8216 (='part of' 'Greek alphabet') in, so that this query does not time out:

SELECT DISTINCT ?entityOfInterest ?statementDataValue WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 . # 'part of' 'Greek alphabet'
  VALUES ?qualifierObject { wd:Q8185162 }
  VALUES ?qualifierProperty { wd:P518 }
  VALUES ?statementProperty { wd:P487 }

  # template
  ?entityOfInterest ?pprop ?statement .
  ?statementProperty wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier ?qualifierObject ;
             ?statementProp ?statementDataValue .  
  ?qualifierProperty wikibase:qualifier ?qualifier .

So, there is our recipe, for everyone to copy/paste.

Completing the Greek alphabet example
OK, now since I actually started with the upper and lower case Unicode character for Greek letters, let's finish that query too. Since we need both, we need to use the template twice:

SELECT DISTINCT ?entityOfInterest ?lowerCase ?upperCase WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .

Still one issue left to fix. Some greek letters have more than one upper case Unicode character. We need to concatenate those. That requires a GROUP BY and the GROUP_CONCAT function, and get this query:

SELECT DISTINCT ?entityOfInterest
  (GROUP_CONCAT(DISTINCT ?lowerCase; separator=", ") AS ?lowerCases)
  (GROUP_CONCAT(DISTINCT ?upperCase; separator=", ") AS ?upperCases)
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
} GROUP BY ?entityOfInterest

Now, since most of my blog posts are not just fun, but typically also have a use case, allow me to shed light on the context. Since you are still reading, your officially part of the secret society of brave followers of my blog. Tweet to my egonwillighagen account a message consisting of a series of letters followed by two numbers (no spaces) and another series of letters, where the two numbers indicate the number of letters at the start and the end, for example, abc32yz or adasgfshjdg111x, and I will you add you to my secret list of brave followers (and I will like the tweet; if you disguise the string to suggest it has some meaning, I will also retweet it). Only that string is allowed and don't tell anyone what it is about, or I will remove you from the list again :) Anyway, my ambition is to make a Wikidata-based BINAS replacement.

So, we only have a human readable name. The frequently used SERVICE wikibase:label does a pretty decent job and we end up with this table:

Sunday, September 09, 2018

cOAlition S with their Plan S

Screenshot of the Scholia page for Plan S.
Last Monday the bomb dropped: eleven European funders (with likely more to follow) indicate that they are not going to support journals that are not fully Open Access, i.e. fully are partially paywalled journals: cOAlition S announced Plan S.

There is a lot of news about this and a lot of discussion: many agree that it is at least an interesting step. Some have argued that the plan advocates a commercial publishing model and that it accepts glossy journals. Indeed, it does not specifically address those points, but other news also suggests that this is Plan S is not the only step funders are undertaking: for example, the Dutch NWO is also putting serious effort in fighting the use of the flawed impact factor.

One thing I particularly like about this Plan S is that it counters joined efforts from our universities (via the VSNU) with their big package deals that currently favor hybrid journals over full Open Access journals. That is, I can publish my cheminformatics under a Creative Commons license for free in the hybrid JCIM, where I do not get similar funding for the full Open Access JCheminform.

Another aspect I like is that it insists on the three core values of Open Science, the rights to:

  1. reuse,
  2. modify, and
  3. share.
I cannot stress this enough. Only with these core values, we can build on earlier knowledge and earlier research. It is worth reading all ten principles.

Keeping up to date
We will see a lot of analyses of what will happen now. Things will have to further unfold. We will see that other funders will join, and we have seen that some funders did not join yet, because they were unsure if they could make the time line (like the Swedish VR). There are a few ways to keep updated. First, you can use the RSS feed of Scholia for both the Plan S and the cOAlition S (see the above screenshot). But this is mostly material with an DOI and not general news. Second, you could follow the oa.plan_s and oa.coalitions tags of the Open Access Tracking Project.

Bianca Kramer has used public data to make an initial assessment of the impact of the plan (full data):
Screenshot of the graph showing contributions or license types to literature for ten of the eleven
research funders. Plan S ensures that these bare become 100% pure gold (yellow).
It was noted (cannot find the tweet, right now...) that the amount of literature based on funding from cOAlition S is only 1-2% of all European output. That's not a lot, but keep in mind: 1. more funders will join, 2. a 1-2% extra pressure will make shareholders think, 3. the Plan S stops favoring hybrid journals over Open Access journals, and 4. the percentage extra submissions to full Open Access journals will be significantly higher compared to their current count.

Saturday, September 08, 2018

Also new this week: "Google Dataset Search"

There was a lot of Open Science news this week. The announcement of the Google Dataset Search was one of them:

 Of course, I first tried searching for "RDF chemistry" which shows some of my data sets (and a lot more):

It picks up data from many sources, such as Figshare in this image. That means it also works (well, sort of, as Noel O'Boyle noticed) for supplementary information from the Journal of Cheminformatics.

It picks up metadata in several ways, among which So, next week we'll see if we can get eNanoMapper extended to spit compatible JSON-LD for its data sets, called "bundles".

Integrated with Google Scholar?
While the URL for the search engine does not suggest the service is more than a 20% project, we can hope it will stay around like Google Scholar has been. But I do hope they will further integrate it with Scholar. For example, in the above figure, it did pick up that I am the author of that data set (well, repurposed from an effort of Rich Apodaca), it did not figure out that I am also on Scholar.

So, these data sets do not show up in your Google Scholar profile yet, but they must. Time will tell where this data search engine is going. There are many interesting features, and given the amount of online attention, they won't stop development just yet, and I expect to discover more and better features in the next months. Give it a spin!

Mastodon: somewhere between Twitter and FriendFeed

Now I forgot who told me about it (sorry!), but I started looking at Mastodon last week. Mastodon is like Twitter or Whatsapp, but then distributed and federated. And that has advantage: no vendor lock-in, servers for specific communities. In that sense, it's more like email, with neither one point-of-failure, but potentially many points-of-failure.

But Mastodon is also very well done. In the past week I set up two accounts, one of a general server and one on a community server (more about that in a second). I am still learning, but want to share some observations.

First, the platform is not unique and there are other, maybe better) distributed and federated software solutions, but Mastodon is slick. This is what my profile page looks like, but you can change the theme if you like. So far, pretty standard:

My profile page.
Multiple accounts
While I am still exploring this bit, you can have multiple accounts. I am not entire sure yet how to link them up, but currently they follow each other. My second account is on a server aimed at scholars and stuff that scholars talk about. This distributed features is advertised as follows: sometimes you want to talk science, something you want to talk movies. The last I would do on my account and the science on my account.

However, essential is that you can follow anyone on any server: you do not have to be on to follow my toots there. (Of course, you can also simply check the profile page, and you can read my public toots without any Mastodon account.)

This topic server is a really exciting idea. This provides an alternative to mailing lists or slack rooms. And each server decides on their own community norms, and any account can be blocked of violating those community norms. No dependency on Twitter or Facebook to decide what is right or wrong, the community can do that themselves.

BioMedCentral could host one server for each journal... now that's an idea :)

Controlling what you see
OK, let's look at a single toot, here about a recent JACS paper (doi:10.1021/jacs.8b03913):

Each toot (like tweet) has replies, boosts (like retweets), and favorites (like likes). Now, I currently follow this anonymous account. You can do the normal things, like follow, send direct messages, mute, and block accounts:

You can follow toots from just the people you follow, but also follow all tweets on that particular server (which makes sense of you have a server about a very specific topic), or toots on all federated servers (unwise).

The intention of Mastodon is to give the users a lot of control. You should not expect non-linear timelines or promoted toots. If that is your things, better stay on Twitter. An example of the level of control is what options it offers me for my "Notifications" timeline:

Other stuff I like
Some random things that I noticed: there is more room for detail and you have 500 chars, URLs are not shortened (each URL counts as 20 chars), animated GIFs are animated when I hover over them. Cool, no need to pause them! You cannot edit toots, but at least I found a "Delete and redraft" option. Mastodon has solutions for hiding sensitive material. That causes to part of the toot to be hidden by default. This can be used to hide content that may upset people, like medical images of intestines :) The CW is short for Content Warning and is used for that.

There is a lot more, but I'm running out of time for writing this blog post. Check out this useful An Increasingly Less-Brief Guide to Mastodon.

So, who to follow?
Well, I found two options. One is, use Wikidata, where you can search for authors with a (one or more) Mastodon accounts. For example, try this query to find accounts for Journal of Cheminformatics authors:

Yes, that list is not very long yet:

But given the design and implementation of Mastodon, this could change quickly.

Well, some of you are old enough to remember FriendFeed. The default interface is different, but if I open a single toot in a separate page, it does remind me a lot of FriendFeed, and I am wondering in Mastodon can be that FriendFeed replacement we have long waited for! What do you think?

Saturday, September 01, 2018

Biological stories: give me a reason to make time for interactive biological pathways

I am overbooked. There is a lot of things I want to do that (I believe) will make science better. But I don't have the time, nor enough funding to hire enough people to help me out (funders: hint, hint). Some things are not linked to project deliverables, and are things I have to do in my free time. And I don't have much of that left, really, but hacking on scientific knowledge relaxes me and give me energy.

Interactive biological stories
Inspider by Proteopedia (something you must check out, if you do not already know it), Jacob Windsor did an internship to bring this idea to WikiPathways: check his Google Summer of Code project. I want to actually put this to use, but, as outlined above, not being part of paid deliverables, it's hard to find time for it.

Thus, I need a reason. And one reason could be: impact. After all, if my work has impact on the scientific community, that helps me keep doing my research. So, when eLife tweeted an interesting study on digoxin, I realized that if enough people would be interested in such a high profile story, an interactive pathway may have enough impact for me to free up time. Hence, I asked on Twitter:

But as you can see, there not just enough RTs yet. So, because I really want something fun to do, please do give me no excuse to not do this (but write grant proposals instead), and retweet this tweet. Thanks!

Saturday, August 18, 2018

Compound (class) identifiers in Wikidata

Bar chart showing the number of compounds
with a particular chemical identifier.
I think Wikidata is a groundbreaking project, which will have a major impact on science. One of the reasons is the open license (CCZero), the very basic approach (Wikibase), and the superb community around it. For example, setting up your own Wikibase including a cool SPARQL endpoint, is easily done with Docker.

Wikidata has many sub projects, such as WikiCite, which captures the collective of primary literature. Another one is the WikiProject Chemistry. The two nicely match up, I think, making a public database linking chemicals to literature (tho, very much needs to be done here), see my recent ICCS 2018 poster (doi:10.6084/m9.figshare.6356027.v1, paper pending).

But Wikidata is also a great resource for identifier mappings between chemical databases, something we need for our metabolism pathway research. The mapping, as you may know, are used in the latter via BridgeDb and we have been using Wikidata as one of three sources for some time now (the others being HMDB and ChEBI). WikiProject Chemistry has a related ChemID effort, and while the wiki page does not show much recent activity, there is actually a lot of ongoing effort (see plot). And I've been adding my bits.

Limitations of the links
But not each identifier in Wikidata has the same meaning. While they are all classified as 'external-id', the actual link may have different meaning. This, of course, is the essence of scientific lenses, see this post and the papers cited therein. One reason here is the difference in what entries in the various databases mean.

Wikidata has an extensive model, defined by the aforementioned WikiProject Chemistry. For example, it has different concepts for chemical compounds (in fact, the hierarchy is pretty rich) and compound classes. And these are differently modeled. Furthermore, it has a model that formalizes that things with a different InChI are different, but even allows things with the same InChI to be different, if need arises. It tries to accurately and precisely capture the certainty and uncertainty of the chemistry. As such, it is a powerful system to handle identifier mappings, because databases are not clear, and chemistry and biological in data is even less: we measure experimentally a characterization of chemicals, but what we put in databases and give names, are specific models (often chemical graphs).

That model differs from what other (chemical) databases use, or seem to use, because not always do databases indicate what they actually have in a record. But I think this is a fair guess.

ChEBI (and the matching ChEBI ID) has entries for chemical classes (e.g. fatty acid) and specific compounds (e.g. acetate).

PubChem, ChemSpider, UniChem
These three resources use the InChI as central asset. While they do not really have the concept of compound classes so much (though increasingly they have classifications), they do have entries where stereochemistry is undefined or unknown. Each one has their own way to link to other databases themselves, which normally includes tons of structure normalization (see e.g. doi:10.1186/s13321-018-0293-8 and doi:10.1186/s13321-015-0072-8)

HMDB (and the matching P2057) has a biological perspective; the entries reflect the biology of a chemical. Therefore, for most compounds, they focus on the neutral forms of compounds. This makes linking to/from other databases where the compound is not neutral chemically less precise.

CAS registry numbers
CAS (and the matching P231) is pretty unique itself, and has identifiers for substances (see Q79529), much more than chemical compounds, and comes with a own set of unique features. For example, solutions of some compound, by design, have the same identifier. Previously, formaldehyde and formalin had different Wikipedia/Wikidata pages, both with the same CAS registry number.

Limitations of the links #2
Now, returning to our starting point: limitations in linking databases. If we want FAIR mappings, we need to be as precise as possible. Of course, that may mean we need more steps, but we can always simplify at will, but we never can have a computer make the links more complex (well, not without making assumptions, etc).

And that is why Wikidata is so suitable to link all these chemical databases: it can distinguish differences when needed, and make that explicit. It make mappings between the databases more FAIR.

Thursday, August 09, 2018

Alternative OpenAPIs around WikiPathways

I blogged in July about something I learned at a great Wikidata/ERC meeting in June: grlc. It's comparable to but different from the Open PHACTS API: it's a lot more general (and works with any SPARQL end point), but also does not have the identifier mapping service (based on BridgeDb) which we need to link the verious RDF data sets in Open PHACTS.

Of course, WikiPathways already has a OpenAPI and it's more powerful than we can do based on just the WikiPathways RDF (for various reasons), but the advantage is that you can expose any SPARQL query (see the examples at on the WikiPathways end point. As explained in July, you only have to set up a magic GitHub repository, and Chris suggested to show how this could be used to mimick some of the existing API methods.

The magic
The magic is defined in this GitHub repository, which currently exposes a single method:

#+ summary: Lists Organisms
#+ endpoint_in_url: False
#+ endpoint:
#+ tags:
#+   - Organism list

PREFIX rdfs:    

SELECT DISTINCT (str(?label) as ?organism)
    ?concept wp:organism ?organism ;
      wp:organismName ?label .

The result
I run grlc in the normal way and point it to egonw/wp-rdf-api and the result looks like:

And executing the method in this GUI (click the light blue bar of the method), results in a nice CSV reply:

Of course, because there is SPARQL behind each method, you can make any query you like, creating any OpenAPI methods that fit your data analysis workflow.

Wednesday, August 08, 2018

Green Open Access: increase your Open Access rate; and why stick with the PDF?

Icon of Unpaywall, a must have
browser extension for the modern
Researchers of my generation (and earlier generations) have articles from the pre-Open Access era. Actually, I have even be tricked into closed access later; with a lot of pressure to publish as much as you can (which some see as a measure of your quality), it's impossible to not make an occasional misstep. But then there is Green Open Access (aka self-archiving), a concept I don't like, but is useful in those situations. One reason why I do not like it, is that there are many shades of green, and, yes, they all hurt: every journal has special rules. Fortunately, the brilliant SHERPA/RoMEO captures this.

Now, the second event that triggered this effort was my recent experience with Markdown (e.g. the eNanoMapper tutorials) and how platform like GitHub/GitLab built systems around it to publish this easily.

Why this matters to me? If I want to have my work have impact, I need people to be able to read my work. Open Access is one route. Of course, they can also email me for a copy the article, but I tend to be busy with getting new grants, supervision, etc. BTW, you can easily calculate your Open Access rate with ImpactStory, something you should try at least once in your life...

Step 1: identify which articles need an green Open Access version
Here, Unpaywall is the right tool, which does a brilliant job at identifying free versions. After all, one of your co-authors may already have self-archived it somewhere. So, yes, I do have a short list, one one of the papers was the second CDK paper (doi:10.2174/138161206777585274). The first CDK article was made CC-BY three years ago, with the ACS AuthorChoice program, but Current Pharmaceutical Design (CPD) does not have that option, as far as I know.

Step 2: check your author rights for green Open Access
The next step is to check SHERPA/RoMEO for your self-archiving rights. This is essential, as this is different for every journal; this is basically business model by obscurity, and without any standardization this is not FAIR in any way. For CDP it reports that I have quite a few rights (more than some bigger journals that still rely on Green to call themselves an "leading open access publisher", but also less than some others):

SHERPA/RoMEO report for CPD.
Many journals do not allow you to self-archive the post-print version. And that sucks, because a preprint is often quite similar, but just not the same deal (which is exactly what closed access publishers want). But being able to post the post-print version is brilliant, because few people actually even kept the last submitted version (again, exactly what closed access publishers want). This report also tells you where you can archive it, and that is not always the same either: it's not uncommon that self-archiving on something like Mendeley or Zotero is not allowed.

Step 3: a post-print version that is not the publisher PDF??
Ah, so you know what version of the article you can archive, and where. But we cannot archive the publisher PDF. So, no downloading of the PDF from the publisher website and putting that online.

Step 4: a custom PDF
Because in this case we are allowed to archive the post-print version, I am allowed to copy/paste the content from the publisher PDF. I can just create a new Word/LibreOffice document with that content, removing the publisher layout and publisher content, and make a new PDF of that. A decent PDF reader allows you to copy/paste large amounts of content in one go, and Linux/Win10 users can use pdfimages to extract the images from the PDF for reuse.

Step 5: why stick with the PDF?
But why would we stick with a PDF? Why not use something more machine readable? Something where that support syntax highlighting, downloading of table content as CSV, etc? And that made me think of my recent experiments with Markdown.

So, I started of with making a Markdown version of the second CDK paper.

In this process, I:

  1. removed hyphenation used to fit words/sentences nicely in PDF columns;
  2. wrapped the code sections for syntax highlighting
  3. recovered the images with pdfimages;
  4. converted the table content to CSV (and used Markdown Tables Generator to create Markdown content) and added "Download as CSV" links to the table captions;
  5. made the URLs clickable; and,
  6. added ORCID icons for the authors (where known).
Preview of the self-archived post-print of the second CDK article.
Step 6: tweet the free Green Open Access link
Of course, if no one knows about your effort, they cannot find your self-archived version. In due time, Google Scholar may pick it up, but I am not sure yet. Maybe (Bio) will help, but that is something I have yet to explore.

It's important to include the DOI URL in that link, so that the self-archived version will be linked to from services like

Next steps: get Unpaywall to know about your self-archived version
This is something I am actively exploring. When I know the steps to achieve this, I will report on that in this blog.