Saturday, December 30, 2017

Adding SMILES, InChI, etc to Wikidata alkane pages

Ten alkanes in Wikidata. The ones without CAS regsitry
number previously did not have InChIKey or
PubChem CID. But no more; I added those.
While working on the 'chemical class' aspect for Scholia yesterday I noted that the page for alkanes was quite large, with a list of more than 50 long chain alkanes with pages in the Japanese Wikipedia with no SMILES, InChI, InChIKey, etc.

So, I dug up my Bioclipse scripts to add chemicals to Wikidata starting with a SMILES (btw, the script has significantly evolved since) and extended the query of that Scholia aspect to list just the Wikidata Q-code and name.  This script starts with one or more SMILES strings and generated QuickStatements (a must-learner).

Because the Wikidata entries also had the English IUPAC name, I can use that to autogenerate SMILES. Enter the OPSIN (doi:10.1021/ci100384d) plugin for Bioclipse which in combination with the CDK allowed me to create the matching SMILES, InChI, InChIKey, and use the latter to look up the PubChem compound identifier (CID). This is the script I ended up with:

inputFile = "/Wikidata/Alkanes/alkanes.tsv"
new File(bioclipse.fullPath(inputFile)).eachLine { line ->
  fields = line.split("\t")
  if (fields[0].startsWith("")) {
    wdid = fields[0].substring("".length())
    name = fields[1]
    if (fields.length > 2) { // skip entities that already have an InChIKey
      inchikey = fields[2]
      // println "Skipping: $wdid $inchikey"
    } else { // ok, consider adding it
      // println "Considering $wdid $name"
      try {
        mol = opsin.parseIUPACName(name)
        smiles = cdk.calculateSMILES(
        //println "  SMILES: $smiles"
        println "${smiles}\t${wdid}"
      } catch (Exception error) {
        //println "Could not parse $name with OPSIN: ${error.message}" 

That way, I ended up with changes like this:

Friday, December 29, 2017

Using Scholia as Open Notebook Science tool to support literature searching

Source: Compound Interest, Andy Brunning.
I have blogged about Scholia and the underlying Wikidata before. Following the example of this WikiProject Zika Corpus I am using Scholia (doi:10.1007/978-3-319-70407-4_36, or in Scholia, of course :) as a tool to support a literature study, to collect articles about a certain topic. Previously I used it to track the publication trail around the Elsevier-SciHub interactions. But when I was linking the Compound Interest infographics for the Advent 2017 series to Wikidata items (aiming to archive them on Zenodo) and ran into the poisonous mistletoe graphics of day 9. In this graphics it mentions the phoratoxins. Sadly, not too much was recorded about that in Wikidata.

So, I did an quick scan of literature (about half an hour, using Google Scholar). I ended up with a few articles about the chemistry of this compound, and as good open scientists I used Wikidata and Scholia as a notebook:

From these papers I found reference to six specific, phoratoxins A-F, for which I subsequently created Wikidata items:

I have a lot to discover about these cyclic peptides and they cannot be found in PubChem or ChemSpider (yet):

The SPARQL I uses is as follows and can be run yourself (note the "edit" link in the left corner of this link):

SELECT ?mol ?molLabel ?InChIKey ?CAS ?ChemSpider ?PubChem_CID WITH {
      ?mol wdt:P31/wdt:P279* wd:Q46995757 .
    } LIMIT 500
  } AS %result
    INCLUDE %result
    OPTIONAL { ?mol wdt:P235 ?InChIKey }
    OPTIONAL { ?mol wdt:P231 ?CAS }
    OPTIONAL { ?mol wdt:P661 ?ChemSpider }
    OPTIONAL { ?mol wdt:P662 ?PubChem_CID }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

And since I had a few other compound classes there, and in our metabolomics research too, of course, I finally hacked up an extension of Scholia for chemical classes (pull request pending). This is what it looks like for fatty acid:

That makes browsing information about chemicals in Wikidata a lot easier and support our effort to link WikiPathways to Wikidata considerable.

I also used this approach for other topics:

Looking at these pages again, it's great to see the community nature of Wikidata in action. The pages grow in richness over time :)

Wednesday, December 27, 2017

Two Papers about Adverse Outcome Pathways

Slice of Fig. 5 of Penny's paper (see main text).
Over the past year our group got involved in two projects where Adverse Outcome Pathways (AOPs) are used. These AOPs are risk assessment tools, but when linked to biological pathways they get a lot more interesting. the key events (KEs) in the AOPs can be linked to bioassays and thus biological processes. Earlier this year Marvin Martens (follow him on Twitter) started as PhD candidate to work on EU-ToxRisk and OpenRiskNet on exactly this integration. The first of these two projects got recently described in "Adverse outcome pathways: opportunities, limitations and open questions" (doi:10.1007/s00204-017-2045-3).

But at the end of eNanoMapper we collaborated with Penny Nymark and Roland Grafström on bioassay measurements of biological responses to exposure of nanomaterials. "We" is particularly Freddie and Linda who worked with Penny to develop an approach to link AOPs with biological pathways, and the worked on this lung fibrosis pathway:

Penny's paper the describes this approach and this pathway was also recently published: "A Data Fusion Pipeline for Generating and Enriching Adverse Outcome Pathway Descriptions"  (doi:10.1093/toxsci/kfx252).

BTW, this <iframe> embeds this pathway in the page using the JavaScript library pvjs by Anders Riutta from the Gladstone Institutes. You can click the genes to get identifiers for various databases. You can learn on this page on how to use this on your webpage or blog.

Sunday, December 17, 2017

Winter solstice challenge #2: first submission

Citation data for articles colored by
availability of a full text, as indicated
in Wikidata. Mind the artifacts.
Reproduce with this query and the
new GraphBuilder functionality.
Bianca Kramer (of 101 Innovations fame) is the first to submit results for the Winter solstice challenge and it's impressive! She has an overall score based on her own publications and the first level citations of 54%!

So, the current rankings in the challenge are as follows.

Highest Score

  1. Bianca Kramer

Best Tool

  1. Bianca Kramer

I'm sure she is more than happy if you use her tool to calculate your score. If you're patient, you may even wish to take it one level deeper.

What are you talking about??
Well, the original post sheds some idea on this, but basically scientific writing has become so dense, that a single paper does not provide enough information. But if you cannot read the cited papers, you may not be able to precisely reproduce what they did. Now that many countries are steadily heading to 100% #OpenAccess it is time to start thinking about the next step. So, is the knowledge you built on also readily available or is that still locked away.

For example, take the figure on the right-hand side: it shows when articles are published that I cited in my work (a subset, because based on data in Wikidata, using the increasing amount of I4OC data). We immediately see some indication of the availability of the cited papers. The more yellow, the more available. However, keep in mind that this is based on "full text availability" information in Wikidata, which is very sparse. That makes Bianca's approach so powerful: it uses (the equally wonderful)

You also note the immediate quality issues. Apparently, this data tells me I am citing articles from the future :) You also see that I am citing some really old articles.

Saturday, December 16, 2017

The National Plan Open Science Estafette: my own first Open Science steps

Noot: als je liever Nederlands leest, lees dan het origineel.
Earlier this year Delft hosted a meeting for Dutch scholars aimed at hearing and learning about, and to give feedback on the National Plan Open Science (doi:10.2777/061652). I'm very happy I have been able to contribute to this effort, because more and better access to knowledge is very dear to me. During lunch time everyone could demonstrate their own Open Science. From this the idea evolved to have a relay race ("estafette"). In each part of the relay someone will tell about their Open Science story. This post is te start: every next runner tells their story on what role Open Science has in their research. And it does not matter if the focus is on Open Data, Open Access, or Open Source, because the diversity in the Dutch Open Science community is just very high.

My Open Science story goes back to the time that I was studying chemistry at what is now called the Radboud Universiteit. Chemistry students could get access to the internet in 1994 and this opened a world of Open knowledge to me! Our library was well stocked, but I still had to visit research department to read certain journals. Always uncomfortable as a young student to walk into a coffee room with senior researchers.

I learned HTML and later Java. Java, with their applets, brought the internet to life. It could visualize 3D models of chemical structures. A paper journal cannot do that. Twenty years later journals still don't have this functionality, but that's not the point. In those three, four years I got introduced to three projects, each Open Source, Jmol (now JSmol), JChemPaint, and the file format "Chemical Markup Language" (CML). The first was to visualize 3D structures on the internet and the second was to visualize 2D chemical diagrams. CML was a format that could store 2D and 3D coordinates for me. But the problem was that neither Jmol nor JChemPaint could read CML.

But that's where Open Science comes in. After all, I could download the Jmol and JChemPaint source code, change it, and share that with others. That was brilliant! And I dived in. Of course, I could have just used my changes myself, but because I realized it could benefit others too, I sent my changes ("patches") to the authors of Jmol and JChemPaint. Extremely happy and proud I was when the two researchers from Germany and the U.S.A. included those patches in their version!

And in the end it was not in vain. In the final year of my chemistry study I submitted an abstract to an international conference. It got accepted! But now I had to go to Washington (Georgetown, to be precise), to talk about my work. On top of that, we agreed to meet the authors of Jmol and JChemPaint in South Bend, where we laid the foundation of a new Open Science project, the Chemistry Development Kit (CDK). Expensive trip, but fortunately I got a bursary from a Dutch company. A peculiar trip it was. We used an Amtrak sleeping train and had dinner with a soldier who served during D-Day. In New York I stepped off the sidewalk onto the street to evade a group of scary heavy boys (which turned out to be a popular boys band), and we stood in the WTC (a year before 9/11) to hear two tourists ask at the musical ticket sale desk "what broadway was?".

I am proud that I have been able to contribute to these Open Science projects and that I co-founded the CDK. The Open nature of these projects have had a significant impact and, after twenty years, still do. Sure, it's not the same is discovering a new protein or metabolite, but these projects definitely not only benefited my research. Of course, also with a huge thanks to Hens Borkent, Dick Wife, Dan Gezelter, Christoph Steinbeck, and Peter Murray-Rust.

BTW, thinking about this relay race, Open Science itself is also a relay race: you take the token of the people before you, adopt the token, and pass on the token to the next scientist. And every day the token gets brighter!

This Nationaal Plan Open Science Estafette also continues. I am delighted to pass my token to Rosanne Hertzberger. Read her story here (in Dutch).

Friday, December 15, 2017

Suggestions for ScholarlyHub

Mock Dashboard of ScholarlyHub.
(I'll assume CC-BY for this image.) 
ScholarlyHub is a open scholar profile project. I have yet no idea where this platform is going, but they planned open source nature makes me want to explore it nevertheless. The project is currently running a crowdfunding campaign and developing their plans. They asked for feedback, so here goes:

Feature requests:
  • researchers care about research, more than profiles: make things from their research ("topics") part of their profile; let them tell everyone what they are interested in
  • the website should have an API (good looks is not enough). Have you done a persona analysis? User friendly is only defined if you have defined your users.
  • make the resource FAIR: use or RDFa
  • show innovation into new scholarly activities: provide peer review functionality, etc (similar to Publons, PubPeer, PubMed Commons, etc)
  • support data and software citations
  • use identifiers (DOI, ORCID, project IDs (CORDIS, etc), etc)
  • integration of I4OC
  • freely provide #altmetrics
  • release soon, release often
  • use RSS for any bit of information on the site (one form of API, in fact)
  • integrate my social feeds into my profile (Twitter, blog, LinkedIn, etc)
You can browse my blog for other features I have recommended to websites in the past. You can also check Scholia for ideas.

New paper: "Integration among databases and data sets to support productive nanotechnology: Challenges and recommendations"

Figure 1 from the NanoImpact article. CC-BY.
The U.S.A and European nanosafety communities have a longstanding history of collaboration. On both sides there are working groups, NanoWG and WG-F (previously called WG4) of the NanoSafety Cluster. I have been chair of WG4 for about three years and still active in the group, though in the past half year, without dedicated funding, less active. That is already changing again with the imminent start of the NanoCommons project.

One of these collaborations resulted in a series of papers around data curation (see doi:10.1039/C5NR08944A and doi:10.3762/bjnano.6.189). Part of this effort was also an survey about the state of databases. A good number of databases responded to the call. It turned out non-trivial to analyse the results and write up a report around it with recommendations. The first version was submitted and rejected, and with fresh leadership, the paper underwent a significant restructuring by John Rumble and resubmitted to Elsevier's NanoImpact and now online (doi:10.1016/j.impact.2017.11.002).

The paper outlines an overview of challenges and a recommendation to the community on how to proceed. That is, basically, how should projects like eNanoMapper, caNanoLab, and Nanomaterial Registry evolve to, and what might the European Union Observatory for Nanomaterials (EUON) look like. BTW, a similar paper by Tropsha et al. was recently published the other week with a focus on the USA database ecosystem (doi:10.1038/nnano.2017.233).

Have fun reading it, and if you are working in a related field, please join either of the two aforementioned working groups! And a huge thanks to everyone involved, particular Sandra, John, and Christine.

Saturday, December 09, 2017 every house a library

Two weeks ago someone made me aware of This new social profile page allows you to digitize a list of your books. That in itself is already very useful. In the past I have made some efforts to create overviews of the books I own. If only to get some count, with respect to insurance, etc. But a very neat feature is the ability to find other people in your neighborhood with whom you may exchange books: for each book you can indicate who can see you own that book (no one, by default; if not mistaken, inheriting the ugw approach of UNIX), and if others can borrow or buy this book.

This shows users with public profiles in The Netherlands, showing the uptake has not been massive yet, but I'm hoping this post changes that :)

But the killer feature I am waiting for is a map like this, but then for books. Worldcat has a feature where it lists you the nearest library that has a copy of the book you are looking for:

And such a feature (map or list) would be brilliant for every house would potentially be a library. That idea appeals to me.

Oh, and it's Wikidata-based but that should hardly be a surprise.

Sunday, December 03, 2017

De Nationaal Plan Open Science Estafette: mijn eerste Open Science stappen

Note: For my English-only followers, I translated this to English.
Eerder dit jaar werd er in Delft een meeting voor Nederlandse onderzoekers georganiseerd om te horen over en feedback te geven op het Nationaal Plan Open Science (doi:10.2777/061652). Ik prijs me rijk dat ik hieraan heb kunnen en mogen meedenken, want meer toegankelijkheid tot wetenschappelijke kennis ligt me aan het hart. Tijdens de lunch konden mensen hun Open Science laten zien. Hieruit vloeide het idee voort om een estafette op te zetten om zo veel mogelijk te laten zien hoe veel Open Science er eigenlijk al in Nederland gedaan wordt. Hierbij de start: elke volgende estafetteloper vertelt iets over hun Open Science onderzoek. En of de focus nu op Open Data, Open Access, of op Open Source is maakt niet zo veel uit. Want de diversiteit in het Nederlandse Open Science-gemeenschap is nu eenmaal groot.

Mijn Open Science-verhaal gaat terug naar de tijd dat ik student scheikunde was aan wat nu de Radboud Universiteit heet. Wij kregen daar in 1994 toegang tot het internet, en dit opende voor mij een wereld van Open kennis! Onze bibliotheek was goed gevuld, maar soms moest ik naar afdelingen om daar bepaalde tijdschriften in te kunnen kijken. Altijd ongemakkelijk om een koffiekamer met senior onderzoekers binnen te lopen als student.

Ik leerde HTML en later Java. Java, met hun applets, brachten het internet tot leven. Het kon 3D modellen van chemische structuren laten zien. Dat kan een tijdschrift niet. Twintig jaar later kunnen tijdschriften dat nog steeds niet, maar dat terzijde. In de drie, vier jaren daarna leerde ik drie projecten kennen, elk Open Source: Jmol (nu JSmol), JChemPaint, en het bestandsformaat "Chemical Markup Language" (CML). De eerste was om 3D structuren te laten zien op het internet en het tweede visualiseerde twee-dimensionale (2D) chemische structuren. CML was een formaat waarin ik zowel 2D- en 3D-coördinaten kon opslaan. Maar het ding was dat Jmol en JChemPaint helemaal geen CML konden lezen.

Maar daar kwam Open Science om de hoek kijken. Immers, ik kon van Jmol en JChemPaint de broncode downloaden, aanpassen, en delen met anderen. Ik was overtuigd! En ging aan de slag. Natuurlijk had ik mijn aanpassingen gewoon zelf kunnen gebruiken, maar omdat ik dacht dat het voor andere misschien ook wel bruikbaar zou kunnen zijn, stuurde ik mijn aanpassingen ("patches") naar de auteurs van Jmol en JChemPaint. Dolgelukkig en trots was ik toen de wetenschappers in Duitsland en de Verenigde Staten het in hun versie opnamen!

En het heeft me allemaal geen windeieren gelegd. In mijn laatste jaar van mijn studie heb ik een abstract naar een internationale conferentie ingestuurd. Die werd geaccepteerd en dus moest ik naar de Washington (Georgetown, om precies te zijn), om wat over mijn werk te vertellen. Maar bovendien had ik met de auteurs van Jmol en JChemPaint afgesproken in South Bend, waar we de basis gelegd hebben voor een nieuw Open Science project, de Chemistry Development Kit (CDK). Duur, maar gelukkig kreeg ik een beurs van een Nederlands bedrijf. Een wonderlijke reis was het. In de slaaptrein avondeten met een soldaat die tijdens D-Day in actie is geweest, in New York van de stoep stappen omdat er zware jongens aankomen (die een bekende boys band blijken te zijn), en in het WTC staan (een jaar voor 9/11) en horen hoe toeristen bij de musicalticketverkoop vragen "What is broadway?".

Ik ben trots dat ik aan deze Open Science projecten heb kunnen bijdragen en dat ik medeontwerper ben van de CDK. Door hun Open karakter hebben deze projecten in flinke impact gehad, en na twintig jaar, nog steeds hebben. Natuurlijk, het is niet de functie van een eiwit en metaboliet, maar deze projecten hebben zeker niet alleen mijn onderzoek geholpen. Met dank aan anderen, natuurlijk: Hens Borkent, Dick Wife, Dan Gezelter, Christoph Steinbeck, en Peter Murray-Rust.

Trouwens, over estafettes gesproken, Open Science is op zichzelf ook een estafette: je neemt het stokje over van de mensen voor je, geeft er je eigen draai aan, en geeft daarna het stokje weer door aan de volgende. En het stokje wordt elke dag mooier!

Deze Nationaal Plan Open Science Estafette gaat ook verder. Ik mag mijn stokje doorgeven aan Rosanne Hertzberger. Ik ben superbenieuwd naar haar Open Science verhaal! En natuurlijk van alle lopers die daarna aan de estafette deelnemen!

Sunday, November 26, 2017

Winter solstice challenge: what is your Open Knowledge score?

Source: Wikimedia, CC-BY 2.0.
Hi all, welcome to this winter solstice challenge! Umm, to not give our southern hemisphere colleagues not a disadvantage, as their winter solstice has already passes, you're up for a summer solstice challenge!

So, you know ImpactStory and (if not, browse my blog); these are wonderful tools to see what people are doing with your work. I hope you already know about OpenCitations, a collaboration of publishers, CrossRef, and many others, to make all citation data available. They just passed the 50% milestone, congratulations on that amazing achievement! For the younger scientists it may be worth realizing that for the past 20 years, at least, this data was copyrighted and not to be used unless you paid. Elsevier is, BTW, the major culprit still claiming IP on this, but RT this if you are surprised.

So, the reason I introduce both ImpactStory and OpenCitations is the following. Scientific articles are data and knowledge dense documents. If we did not redirect the reader to other literature. That may give a more complete sketch of the context, describe a measurement protocol, describe how certain knowledge was derived, etc. Therefore, just having your article Open Access is not enough: the articles you cite should be Open Access too. That's the next phase if really making an effort to have all of humanity benefit from the fruits of science.

I know it is hard already to calculate a "Open Access" score, though ImpactStory does a great job at that! So, calculating this for your paper and the papers those papers cite is even harder. You may need to brush up your algorithm and programming skills.

Anyone is allowed to participate. Submission of your entry is done online, e.g. in your blog, in a public write up, or even a open notebook! However, you need at least on citable research object. That is, it needs a DOI. Otherwise, I cannot give you the prize (see below). The score should be based on all your products. Bonus points for those who include software and data citations. Excluding citable object to boost your score (for example, I would have to exclude my book chapters), is seen as cheating the system.

Your article B may cite three articles (C, D, J) but
article D also cited articles (F, I). So, your
Open Knowledge score is recursive.
Source: Wikipedia, CC-BY-SA 4.0
Calculating your Open Knowledge score can be done at multiple levels. After all, your article depends (cites) articles, and your software depends on libraries, but those cited articles and software dependencies recursively also cite articles and/or software. The complexity is non-trivial, making it a perfect solstice challenge indeed!

The prize I have to offer is my continued commitment to Open Science, but that you already get for free and may not be enough boon. So, instead, soon after the winter/summer solstice at the end of this year, I will blog about your research boosting your #altmetrics scores. Yes, I will actually read and try to understand it!

And because there is the results and the method, neither of which exist yet, there are two categories! I just doubled your chance of winning! That's because humanity is worth it! One prize for the best tool to calculated your Open Knowledge score, and one prize for the researcher with the highest score.

Audience Prize
If someone feels a need to organize an audience prize, this is very much encouraged! (Assuming Open approaches, of course :)

Wednesday, November 22, 2017

Monitoring changes to Wikidata pages of your interest

Source: User:Cmglee, Wikipedia, CC-BY-SA 3.0
Wikidata is awesome! In just 5 years they have bootstrapped one of the most promising platforms for the future of science.Whether you like the tools more, or the CCZero, there is coolness for everyone. I'm proud to have been able to contribute my small 1x1 LEGO bricks to this masterpiece and hope to continue this for many years to come. There are many people doing awesome stuff, and many have way more time, have better skills, etc. Yes, I'm thinking here if Finn, Magnus, Andra, the whole Su team, and many, many more.

The point of this post, is to highlight something this matters and something that comes up over and over again and where there just are solutions, like implemented by Wikidata: provenance. We're talking a lot about FAIR data. Most of FAIR data is not technological, it's social. And most of the technical work going on now, is basically to overcome those social barriers.

We teach our students to cite primarily literature and only that. There is a clear reason for that: the primary literature has the arguments (experiments, reasoning, etc) that back a conclusion. Not any citing is good enough: it has to be the exact right shape (think about that Lego brick). This track record of our experiments is a wonderful and essential idea. It removes the need for faith and even trust. Faith is for the religious, trust is for the lazy. Now, without being lazy, it is hard to make progress. But as I have said before (Trust has no place in science #2), every scholar should realize that "trust" is just a social equivalent of saying you are lazy. There is nothing wrong with being lazy: a side effect of it is innovation.

Ideally, we do not have to trust any data source. If we must, we just check where that source got its data from. That works for scholarly literature, and works for other sources too. Sadly, scholarly literature has a horrible track record here: we only cite stuff we find more trustworthy. For example, we prefer to cite articles from journals with high impact factors. Second, we don't cite data. Nor software. As a scholarly community, we don't care much about that (this is where lazy is evil, btw!).

Wikidata made the effort to make a rich provenance model. It has a rich system of referring to information sources. It has version control. And it keeps track of who made the changes.

Of all the awesomeness of Wikidata, Magnus is one of the people that know how to use that awesomeness. He developed many tools that make doing to right thing a lot easier. I'm a big fan of his SourceMD, QuickStatement, and two newer tools, ORCIDator and SPARQL-RC. This latter tool leverages SPARQL (and thus Wikidata RDF) and the version control system. By passing a query, it will list all changes in a given time period. I am still looking for a tool that can show my all changes for items I originally created, but this already is a great tool to monitor the quality of crowdsourcing for data in Wikidata I care about. No trust, but the ability to verify.

Here's a screenshot for the changes of (some of my) output of scientific output I am author of:

Sunday, November 12, 2017

New paper: "WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research"

Focus on metabolic pathways increases the
number of annotated metabolites, further improving
the usability in metabolomics. Image: CC-BY.
TL;DR: the WikiPathways project (many developers in the USA and Europe, contributors from around the world, and many people curating content, etc) has published a new paper (doi:10.1093/nar/gkx1064/4612963), with a slight focus on metabolism. 

Full story
Almost six years ago my family and I moved back to The Netherlands for personal reasons. Workwise, I had a great time in Stockholm and Uppsala (two wonderful universities; thanks to Ola Spjuth, Bengt Fadeel, and Roland Grafström), but being immigrant in another country is not easy, not even for a western immigrant in a western country. ("There is evil among us.")

We had decided to return to our home country, The Netherlands. By sheer coincidence, I spoke with Chris Evelo in the week directly following that weekend. I had visited his group in March that year, while attending a COST-action about NanoQSAR in Maastricht. I had never been to Maastricht University yet, and this group, with their Open Source and Open Data projects, particularly WikiPathways, would give us enough to talk about. Chris had a position on the Open PHACTS project open. I was interested, applied, and ended up in the European WikiPathways group led by Martina Kutmon (the USA node is the group of Alex Pico).

Fast forward to now. It was clear to me that biological text book knowledge was unusable for any kind of computation or machine learning. It was hidden, wrongly represented, and horribly badly annotated. In fact, it still is a total mess. WikiPathways offered machine readable text book knowledge. Just what I needed to link the chemical and biological worlds. The more accurate biological annotation we put in these pathways, or semantically link to these pathways, the more precise our knowledge becomes and the better computational approaches can find and learn patterns not obvious to the human eye (it goes both ways, of course! Just read my PhD thesis.)

Over the past 5-6 years I got more and more involved in the project. Our Open PHACTS tasks did involve WikiPathways RDF (doi:10.1371/journal.pcbi.1004989), but Andra Waagmeester (now Micelio) was the lead on that. I focused on the Identifier Mapping Service, based on BridgeDb (together with great work from Carole Goble's lab, e.g. Alasdair and Christian). I focused on metabolomics.

Indeed, there was plenty to be done in terms of metabolic pathways in WikiPathways. The current database had a strong focus on the genetics and proteins aspects of the pathways. In fact, many metabolites were not datanodes and therefore did not have identifiers. And without identifiers, we cannot map metabolomics data to these pathways. I started working on improving these pathways, and we did some projects using it for metabolomics data (e.g. a DTL Hotel Call project led by Lars Eijssen).

The point of this long introductions is, I am standing on the shoulders of giants. The top right figure shows, besides WikiPathways itself, and the people I just mentioned, more giants. This includes Wikidata, which we previously envisioned as hub of metabolite information (see our Enabling Open Science: Wikidata for Research (Wiki4R) proposal). Wikidata allows me to solve the problem that CAS registry numbers are hard to link to chemical structures (SMILES): it has some 70 thousand CAS numbers.

SPARQL query that lists all CAS registry numbers in Wikidata, along with the matching
SMILES (canonical and isomeric), database entry, and name of the compound. Try it.
A lot more about CAS registry numbers is found in my blog.
Finally, but certainly not least, is Denise Slenter, who started this spring in our group. She picked up things I and others were doing very quickly (for example this great work from Maastricht Science Programme students), gave those her own twist, and is now leading the practical work in taking this to the next level. This new WikiPathways paper shows the fruits of her work.

Of course, there are plenty of other pathways database. KEGG is still the gold standard for many. And there is the great work of Reactome, RECON, and many others (see references in the NAR article). Not to mention the important resources that integrate pathways resources. To me, unique strengths of WikiPathways include the community approach, very liberal licence (CCZero), many collaborations (do we have a slide on that?), and, importantly, its expressiveness. The latter allows our group to do the systems biology work that we do, analyzing microRNA/RNASeq data, studying diseases at a molecular interaction level, see the effects of personal genetics (SNPs, GWAS), and visually integrate and summarize the combination of experimental data and text book knowledge.

OK, this post is now already long enough. And seeing from the length, you can see how much I am impressed with WikiPathways and where it goes. Clearly, there is still a lot left to do. And I am just another person contributing to the project and honored that we could give this WikiPathways paper a metabolomics spin. HT to Alex, Tina, and Chris for that!

Slenter, D. N., Kutmon, M., Hanspers, K., Riutta, A., Windsor, J., Nunes, N., Mélius, J., Cirillo, E., Coort, S. L., Digles, D., Ehrhart, F., Giesbertz, P., Kalafati, M., Martens, M., Miller, R., Nishida, K., Rieswijk, L., Waagmeester, A., Eijssen, L. M. T., Evelo, C. T., Pico, A. R., Willighagen, E. L., Nov. 2017. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Research.

Sunday, October 29, 2017

Happy Birthday, Wikidata!

Wikidata celebrates their 5th birthday with a great WikidataCon in Berlin. Sadly, I could not join in person, so I assuming it is a great meeting, following the #WikidataCon hash tag and occasionally the live stream.

Happy Birthday, Wikidata!

My first encounter was soon after they started, and was particularly impressed by the presentation by Lydia Pintscher at the Dutch Wikimedia Conferentie 2012. I had played with DBPedia occasionally but always disappointed by the number of issues with extracting chemistry from the ChemBox infobox, but that's of course the general problem with data that has been mangled into something that looks nice. We know that problem from text mining from PDFs too. Of course, if you start with something machine readable in the first place, your odds for success are much higher.

Yesterday, Lydia shows the State of Wikidata and I think they delivered on their promise.

I did not create my Wikidata account until a year later but did not use the account much in the first two years. But the Wikidata team did a lot of great work in their first three years, and somewhere in 2015 I wrote my first blog post about Wikidata. That year Daniel Mietchen also asked me to join the writing of a project proposal (later published in RIO Journal). The reason for more active adoption of Wikidata and joining Daniel's writing team, was the CCZero license and that chemical identifiers had really picked up. Indeed, free CAS numbers was an important boon. Since then, I have been using Wikidata as data source for our BridgeDb project and for WikiPathways (together with Denise Slenter). I also have to mention the work by Andra Waagmeester and the rest of the Andrew Su team gave me extra support to push Wikidata in our local research agenda around FAIR data.

The Wikidata RDF export and SPARQL end point was an important tipping point. This makes reuse of Wikidata so much easier. Integrating slices of data with curl is trivial and easy to integrate into other projects, as I do for BridgeDb. Someone in the education breakout session mentioned that you can use the interactive SPARQL end point even with people with zero programming experience. I wholeheartedly agree. That is exactly what I did last Thursday at the Surf Verder bouwen aan Open Science seminar. The learning curve with all the example queries is so shallow, it is generally applicable.

And then there is Scholia. What do I need to say? Impressive project by Finn Nielsen to which I am happy to contribute. Check out his WikidataCon talk. Here I am contributing to the biology corner and working on RSS feeds. It makes a marvelous tool to systematically analyze literature, e.g. for the Rett Syndrome as disease or as topic.

Wikidata has evolved to a tremendously useful resource in my biology research and I cannot imagine where we will be next year, at the sixth Wikidata birthday. But it will be huge!

Sunday, October 15, 2017

Two conference proceedings: nanopublications and Scholia

The nanopublication conference article in
It takes effort to move scholarly publishing forward. And the traditional publishers have not all shown to be good at that: we're still basically stuck with machine-broken channels like PDFs and ReadCubes. They seem to all love text mining, but only if they can do it themselves.

Fortunately, there are plenty of people who do like to make a difference and like to innovate. I find this important, because if we do not do it, who will. Two people who make an effort are two researchers who recently published their work as conference proceedings: Tobias Kuhn and Finn Nielsen. And I am happy to have been able to contribute to both efforts.

Tobias works on nanopublications which innovates how we make knowledge machine readable. And I have stressed how important this is in my blog for years. Nanopublications describe how knowledge is captures, makes it FAIR, but importantly, it links the knowledge to the research that led to the knowledge. His recent conference proceedings details how nanopublications can be used to establish incremental knowledge. That is, given two sets of nanopubblications, it determines which have been removed, added, and changed. The paper continues outlining how that can be used to reduce, for example, download sizes and how it can help establish an efficient change history.

And Finn developed Scholia, an interface not unlike Web-of-Science. But then based on Wikidata and therefore fully on CCZero data. And, with a community actively adding the full history of scholarly literature and the citations between papers, courtesy to the Initiative for Open Citations. This is opening up a lot of possibilities: from keeping track of articles citing your work, to get alerts of articles publishing new data on your favorite gene or metabolite.

Kuhn T, Willighagen E, Evelo C, Queralt-Rosinach N, Centeno E, Furlong L. Reliable Granular References to Changing Linked Data. In: d'Amato C, Fernandez M, Tamma V, Lecue F, Cudré-Mauroux P, Sequeda J, et al., editors. The Semantic Web – ISWC 2017. vol. 10587 of Lecture Notes in Computer Science. Springer International Publishing; 2017. p. 436-451. doi:10.1007/978-3-319-68288-4_26

Nielsen FÃ, Mietchen D, Willighagen E. Scholia and scientometrics with Wikidata.; 2017. Available from:

Sunday, October 08, 2017

CDK used in SIRIUS 3: metabolomics tools from Germany

Screenshot from the SIRIUS 3 Documentation.
License: unknown.
It has been ages I blogged about work I heard about and think should receive more attention. So, I'll try to pick up that habit again.

After my PhD research (about machine learning (chemometrics, mostly), crystallography, QSAR) I first went into the field metabolomics. Because is combines core chemistry with the complexity biology. My first position was with Chris Steinbeck, in Cologne, within the bioinformatics institute led by Prof. Schomburg (of the BRENDA database). During that year, I worked in a group that worked on NMR data (NMRShiftDb, dr. Stefan Kuhn), Bioclipse (collaboration with Ola Spjuth), and, of course, the Chemistry Development Kit (see our new paper).

This new paper, actually, introduces functionality that was developed in that year, for example, work started by Miquel Rojas-Cheró. This includes the work on atom types, which we needed to handle radicals, lone pairs, etc, for delocalisation. It also includes work around handling molecular formula and calculating molecular formulas from (accurate) molecular masses. For the latter, more recent work even further improved on earlier work.

So, whenever metabolomics work is published and they use the CDK, I realize that what the CDK does has impact. This week Google Scholar alerted me about a user guidance document for SIRIUS 3 (see the screenshot). Seems really nice (great) work from Sebastian Böcker et al.!

It also makes me happy, as our Faculty of Heath, Medicine, and Life Sciences (FHML) is now part of the Netherlands Metabolomics Center, and that we published the recent article our vision of a stronger, more FAIR European metabolomics community.

Wednesday, October 04, 2017

new paper: "The future of metabolomics in ELIXIR"

CC-BY from F1000 article.
This spring I attended a meeting organized by researchers from the European metabolomics community, including from PhenoMeNal to talk about proposing a use case to ELIXIR. Doing research in metabolomics and being part of ELIXIR, I was happy that meeting happened. During the meeting I presented the work from our BiGCaT group (e.g. WikiPathways, see doi:10.1093/nar/gkv1024).

During the meeting various metabolomics topics were discussed, and I pushed for interoperability of chemical (metabolic) structures, which requires structure normalization, equivalence testing, etc. You know, the kind of work that partners in Open PHACTS did, and that we're now trying to bootstrap with ChemStructMaps. It did not make it, but ideas are included in the selected topic.

All this you can read in this meeting write up, peer-reviewed in F1000Research (doi:10.12688/f1000research.12342.1). I am happy to have been given the opportunity to contribute to this work. The work in our group (e.g. from our PhD student Denise) can surely contribute to this community effort.

 Van Rijswijk M, Beirnaert C, Caron C, Cascante M, Dominguez V, Dunn WB, et al. The future of metabolomics in ELIXIR. F1000Research. 2017 Sep;6:1649+. 10.12688/f1000research.12342.1.

Saturday, September 09, 2017

New paper: "RDFIO: extending Semantic MediaWiki for interoperable biomedical data management"

Figure 10 from the article showing what the DrugMet wiki
with the pKa data looked like. CC-BY.
When I was still doing research at Uppsala University, I had a internship student, Samuel Lampa, who did wonderful work on knowledge representation and logic (check his thesis). In that same period he started RDFIO, a Semantic MediaWiki extension to provide a SPARQL end point and some clever feature to import and export RDF. As I was already using RDF in my research, and wikis are great way to explore how to model domain data, particularly when extracted from diverse literature, I was quite interested. Together we worked on capturing pKa data, and Samuel had put DrugMet online. Extracting pKa values from primary literature is a lot of laborious work and crowdsourcing did not pick up. This data was migrated to Wikidata about a year ago.

I also used the RDFIO extension when I started capturing nanosafety data from literature when I worked at Karolinska Institutet. I will soon write up this work, as the NanoWiki (check out these FigShare data releases) was a seminal data set in eNanoMapper, during which I continued adding data to test new AMBIT features.

Earlier this week Samuel's write up of his RDFIO project was published, to which I contributed the pKa use case (doi:10.1186/s13326-017-0136-y). There are various ways to install the software, as described on the RDFIO project site. The DrugMet data as well as the data for the OrphaNet data from the other example use case can also be downloaded from that site.

Lampa, S., Willighagen, E., Kohonen, P., King, A., Vrandečić, D., Grafström, R., & Spjuth, O. (2017). RDFIO: extending semantic MediaWiki for interoperable biomedical data management. Journal of Biomedical Semantics, 8 (1).

Sunday, August 27, 2017

DataCite: the PubMed for data and software

We have services like PubMed, Europe PMC, and Google Scholar to make a list of literature. Scholia/Wikidata and ORCID are upcoming services, but for data and software there are fewer options. One notable exception is DataCite (two past blogs where I mentioned it). There is plenty of caution in interpreting the results, like versioning, the fact that preprints, posters, etc are also hosted by the supported repositories (e.g. Figshare, Zenodo), but it seems the faceted browsing based on metadata is really improving.

This is what my recent "DataCite" history looks like:

And it get's even more exciting when you realize that DataCite integrates with ORCID so that you can have it all listed on your ORCID profile.

Saturday, August 26, 2017

Updated HMDB identifier scheme

I have not found further details about it yet, but noticed half an hour ago that the Human Metabolome Database (doi:10.1093/nar/gks1065) seems to have changes all their identifiers: the added extra zeros. The screenshot for D-fructose on the right shows how the old identifiers are now secondary identifiers. We will face a period of a few years where one resource uses the old identifiers (archives, supplementary information, other databases, etc).

This change has huge implications, including that mere string matching of identifiers becomes really difficult: we need to know if it uses the old scheme or the new scheme. Of course, we can see this simply from the identifier length, but we likely need a bit of software ("artificial intelligence") in our software.

I ran into the change just now, because I was working on the next BridgeDb metabolite identifier mapping database. The release of this weekend will not have the new identifiers for sure: I first need more info, more detail.

For now, if you use HMDB identifiers in your database, get prepared! Using old identifiers to link to the HMDB website seems to work fine, as they have a redirect working at the server level. Starting to think about internally updating your identifiers (by adding two zero's), is likely something to put on the agenda.

What about postprint servers?

Various article version types, including pre and post.
Now that preprint servers are picking up speed, let's talk about postprint servers. Sure, we have plenty of places to place and find discussions about the content of articles (e.g. PubPeer, PubMed Commons, ...), and sure we have retractions and corrections.

But what if we could just make revisions of articles?

And I'm not only talking about typo-fixes, but also clarifications that show up during post-publication peer-review. Not about full revisions; if a paper is wrong, then this is not the method of choice. They should happen frequently either, but sometimes it is just convenient. Maybe to fix broken website URLs?

One point is, ResearchGate, Academia, Mendeley, and the likes allow you to host versions, but we need to track the fixes and versioned DOIs. That metadata is essential: it is the FAIRness of the post-publication life time of a publication.

Thursday, August 17, 2017

Text mining literature that mention JRC representative nanomaterials

The week before a short holiday in France (nature, cycling, hiking, touristic CERN visit; thanks to Philippe for the ViaRhone tip!), I did some further work on contentmining literature that mention the JRC representative nanomaterials. One important reason was that I could play with the tools developed by Lars in his fellowship with The ContentMine.

I had about one day, as there always is work left over to finish in your first week of holiday, and had several OS upgrades to do too (happily running the latest 64bit Debian!). But, as a good practice, I kept an Open Notebook Science practice, and the initial run of the workflow turned out quite satisfactory:

What we see here is content mined from literature searched with "titanium dioxide" with the getpapers tool. AMI then extracted the nanomaterials and species information. Tools developed by Lars aggregated all information into a single JSON, which I converted into input for cytoscape.js with a simple Groovy script. Yeah, click on the image, and you get the live network.

So, if I find a bit of time before I get back to work, I'll convert this output also to eNanoMapper RDF for loading into Of course, then I will run this on other EuropePMC searches too, for other nanomaterials.

Sunday, July 30, 2017

Wikidata visualizes SMILES strings with John Mayfield's CDK Depict

SVG depiction of D-ribulose.
Wikidata is building up a curated collection of information about chemicals. A lot of data originates from Wikipedia, but active users are augmenting this information. Of particular interest, in this respect, is Sebastian's PubChem ID curation work (he can use a few helping hands!). Followers of my blog know that I am using Wikidata as source of compound ID mapping data for BridgeDb.

Each chemical can have one or two associated SMILES strings. A canonical SMILES, that excludes any chirality, and a isomeric SMILES that does include chirality. Because statement values can be linked to a formatter URL, Wikidata often has values associated with a link. For example, for the EPA CompTox Dashboard identifiers it links to that database. Kopiersperre used this approach to link to John Mayfield's CDK Depict.

Until two weeks ago, the formatter URL for both the canonical and isomeric SMILES was he same. I changed that, so that when a isomeric SMILES is depicted, it shows the perceived R,S (CIP) annotation as well. That should help further curation of Wikidata and Wikipedia content.

Wednesday, July 05, 2017

new paper: "A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury"

Figure from the article. CC-BY.
One of the projects I worked on at Karolinska Institutet with Prof. Grafström was the idea of combining transcriptomics data with dose-response data. Because we wanted to know if there was a relation between the structures of chemicals (drugs, toxicants, etc) and how biological systems react to that. Basically, testing the whole idea behind quantitative-structure activity relationship (QSAR) modeling.

Using data from the Connectivity Map (Cmap, doi:10.1126/science.1132939) and NCI60, we set out to do just that. My role in this work was to explore the actual structure-activity relationship. The Chemistry Development Kit (doi:10.1186/s13321-017-0220-4) was used to calculate molecular descriptor, and we used various machine learning approaches to explore possible regression models. Bottom line was, it is not possible to correlate the chemical structures with the biological activities. We explored the reason and ascribe this to the high diversity of the chemical structures in the Cmap data set. In fact, they selected the chemicals in that study based on chemical diversity. All the details can be found in this new paper.

It's important to note that these findings does not validate the QSAR concept, but just that they very unfortunately selected their compounds, making exploration of this idea impossible, by design.

However, using the transcriptomics data and a method developed by Juuso Parkkinen it is able to find multivariate patterns. In fact, what we saw is more than is presented in this paper, as we have not been able to support further findings with supporting evidence yet. This paper, however, presents experimental confirmation that predictions based on this component model, coined the Predictive Toxicogenocics Gene Space, actually makes sense. Biological interpretation is presented using a variety of bioinformatics analyses. But a full mechanistic description of the components is yet to be developed. My expectation is that we will be able to link these components to key events in biological responses to exposure to toxicants.

 Kohonen, P., Parkkinen, J. A., Willighagen, E. L., Ceder, R., Wennerberg, K., Kaski, S., Grafström, R. C., Jul. 2017. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nature Communications 8.

Saturday, June 24, 2017

The Elsevier-SciHub story

I blogged earlier today why I try to publish all my work gold Open Access. My ImpactStory profile shows I score 93% and note that with that 10% of the scientists in general score in that range. But then again, some publisher do make it hard for us to publish gold Open Access. And then if STM industries spreads FUD for their and only their good ("Sci-Hub does not add any value to the scholarly community.", doi:10.1038/nature.2017.22196), I get annoyed. Particularly, as the system makes young scientists believe that transferring copyright to a publisher (for free, in most cases) is a normal thing to do.

As said, I have no doubt that under current copyright law it was to be expected that Sci-Hub was going to be judged to violate that law. I also blogged previously that I believe copyright is not doing our society a favor (mind you, all my literature is copyrighted, and much of it I license to readers allowing them to read my work, copy it (e.g. share it with colleagues and students), and even modify it, e.g. allowing journals to change their website layout without having to ask me). About copyright, I still highly recommend Free Culture by Prof. Lessig (who unfortunately did not run for presidency).

To get a better understand of Sci-Hub and its popularity (I believe gold Open Access is the real solution), I looked at what literature was in Wikidata, using Scholia (wonderful work by Finn Nielsen, see arXiv). I added a few papers and annotated papers with their main subject's. I guess there must be more literature about Sci-Hub, but this is the "co-occuring topics graph" provided by Scholia at the time of writing:

It's a growing story.

As a PhD student, I was often confronted with Closed Access.

It sounds like a problem not so common in western Europe, but it was when I was a fresh student (around 1994). The Radboud's University Library certainly did not have all journals and for one journal I had to go to a research department and sit in their coffee room. Not a problem at all. Big Package deals improved access, but created a vendor lock-in. And we're paying Big Time for these deals now, with insane year-over-year inflation of the prices.

But even then, I was repeatedly confronted with not having access to literature I wanted to read. Not just me, btw, for PhD students this was very common too. In fact, they regularly visited other universities, just to make some copies there. An article basically costed a PhD a train travel and a euro or two copying cost (besides the package deal cost for the visited university, of course). Nothing much has changed, despite the fact that in this electronic age the cost should have gone down significantly, instead of up.

That Elsevier sues Sci-Hub (about Sci-Hub, see this and this), I can understand. It's good to have a court decide what is more important: Elsevier's profit or the human right of access to literature (doi:10.1038/nature.2017.22196). This is extremely important: how does our society want to continue: do we want a fact-based society, where dissemination of knowledge is essential; or, do we want a society where power and money decides who benefits from knowledge.

But the STM industry claiming that Sci-Hub does not contribute to the scholarly community is plain outright FUD. In fact, it's outright lies. The fact that Nature does not call out those lies in their write up is very disappointing, indeed.

I do not know if it is the ultimate solution, but I strongly believe in a knowledge dissemination system where knowledge can be freely read, modified, and redistributed. Whether Open Science, or gold Open Access.

Therefore, I am proud to be one of the 10 Open Access proponents at Maastricht University. And a huge thank you to our library to keep pushing Open Access in Maastricht.