Saturday, December 30, 2017

Adding SMILES, InChI, etc to Wikidata alkane pages

Ten alkanes in Wikidata. The ones without CAS regsitry
number previously did not have InChIKey or
PubChem CID. But no more; I added those.
While working on the 'chemical class' aspect for Scholia yesterday I noted that the page for alkanes was quite large, with a list of more than 50 long chain alkanes with pages in the Japanese Wikipedia with no SMILES, InChI, InChIKey, etc.

So, I dug up my Bioclipse scripts to add chemicals to Wikidata starting with a SMILES (btw, the script has significantly evolved since) and extended the query of that Scholia aspect to list just the Wikidata Q-code and name.  This script starts with one or more SMILES strings and generated QuickStatements (a must-learner).

Because the Wikidata entries also had the English IUPAC name, I can use that to autogenerate SMILES. Enter the OPSIN (doi:10.1021/ci100384d) plugin for Bioclipse which in combination with the CDK allowed me to create the matching SMILES, InChI, InChIKey, and use the latter to look up the PubChem compound identifier (CID). This is the script I ended up with:

inputFile = "/Wikidata/Alkanes/alkanes.tsv"
new File(bioclipse.fullPath(inputFile)).eachLine { line ->
  fields = line.split("\t")
  if (fields[0].startsWith("")) {
    wdid = fields[0].substring("".length())
    name = fields[1]
    if (fields.length > 2) { // skip entities that already have an InChIKey
      inchikey = fields[2]
      // println "Skipping: $wdid $inchikey"
    } else { // ok, consider adding it
      // println "Considering $wdid $name"
      try {
        mol = opsin.parseIUPACName(name)
        smiles = cdk.calculateSMILES(
        //println "  SMILES: $smiles"
        println "${smiles}\t${wdid}"
      } catch (Exception error) {
        //println "Could not parse $name with OPSIN: ${error.message}" 

That way, I ended up with changes like this:

Friday, December 29, 2017

Using Scholia as Open Notebook Science tool to support literature searching

Source: Compound Interest, Andy Brunning.
I have blogged about Scholia and the underlying Wikidata before. Following the example of this WikiProject Zika Corpus I am using Scholia (doi:10.1007/978-3-319-70407-4_36, or in Scholia, of course :) as a tool to support a literature study, to collect articles about a certain topic. Previously I used it to track the publication trail around the Elsevier-SciHub interactions. But when I was linking the Compound Interest infographics for the Advent 2017 series to Wikidata items (aiming to archive them on Zenodo) and ran into the poisonous mistletoe graphics of day 9. In this graphics it mentions the phoratoxins. Sadly, not too much was recorded about that in Wikidata.

So, I did an quick scan of literature (about half an hour, using Google Scholar). I ended up with a few articles about the chemistry of this compound, and as good open scientists I used Wikidata and Scholia as a notebook:

From these papers I found reference to six specific, phoratoxins A-F, for which I subsequently created Wikidata items:

I have a lot to discover about these cyclic peptides and they cannot be found in PubChem or ChemSpider (yet):

The SPARQL I uses is as follows and can be run yourself (note the "edit" link in the left corner of this link):

SELECT ?mol ?molLabel ?InChIKey ?CAS ?ChemSpider ?PubChem_CID WITH {
      ?mol wdt:P31/wdt:P279* wd:Q46995757 .
    } LIMIT 500
  } AS %result
    INCLUDE %result
    OPTIONAL { ?mol wdt:P235 ?InChIKey }
    OPTIONAL { ?mol wdt:P231 ?CAS }
    OPTIONAL { ?mol wdt:P661 ?ChemSpider }
    OPTIONAL { ?mol wdt:P662 ?PubChem_CID }
    SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }

And since I had a few other compound classes there, and in our metabolomics research too, of course, I finally hacked up an extension of Scholia for chemical classes (pull request pending). This is what it looks like for fatty acid:

That makes browsing information about chemicals in Wikidata a lot easier and support our effort to link WikiPathways to Wikidata considerable.

I also used this approach for other topics:

Looking at these pages again, it's great to see the community nature of Wikidata in action. The pages grow in richness over time :)

Wednesday, December 27, 2017

Two Papers about Adverse Outcome Pathways

Slice of Fig. 5 of Penny's paper (see main text).
Over the past year our group got involved in two projects where Adverse Outcome Pathways (AOPs) are used. These AOPs are risk assessment tools, but when linked to biological pathways they get a lot more interesting. the key events (KEs) in the AOPs can be linked to bioassays and thus biological processes. Earlier this year Marvin Martens (follow him on Twitter) started as PhD candidate to work on EU-ToxRisk and OpenRiskNet on exactly this integration. The first of these two projects got recently described in "Adverse outcome pathways: opportunities, limitations and open questions" (doi:10.1007/s00204-017-2045-3).

But at the end of eNanoMapper we collaborated with Penny Nymark and Roland Grafström on bioassay measurements of biological responses to exposure of nanomaterials. "We" is particularly Freddie and Linda who worked with Penny to develop an approach to link AOPs with biological pathways, and the worked on this lung fibrosis pathway:

Penny's paper the describes this approach and this pathway was also recently published: "A Data Fusion Pipeline for Generating and Enriching Adverse Outcome Pathway Descriptions"  (doi:10.1093/toxsci/kfx252).

BTW, this <iframe> embeds this pathway in the page using the JavaScript library pvjs by Anders Riutta from the Gladstone Institutes. You can click the genes to get identifiers for various databases. You can learn on this page on how to use this on your webpage or blog.

Sunday, December 17, 2017

Winter solstice challenge #2: first submission

Citation data for articles colored by
availability of a full text, as indicated
in Wikidata. Mind the artifacts.
Reproduce with this query and the
new GraphBuilder functionality.
Bianca Kramer (of 101 Innovations fame) is the first to submit results for the Winter solstice challenge and it's impressive! She has an overall score based on her own publications and the first level citations of 54%!

So, the current rankings in the challenge are as follows.

Highest Score

  1. Bianca Kramer

Best Tool

  1. Bianca Kramer

I'm sure she is more than happy if you use her tool to calculate your score. If you're patient, you may even wish to take it one level deeper.

What are you talking about??
Well, the original post sheds some idea on this, but basically scientific writing has become so dense, that a single paper does not provide enough information. But if you cannot read the cited papers, you may not be able to precisely reproduce what they did. Now that many countries are steadily heading to 100% #OpenAccess it is time to start thinking about the next step. So, is the knowledge you built on also readily available or is that still locked away.

For example, take the figure on the right-hand side: it shows when articles are published that I cited in my work (a subset, because based on data in Wikidata, using the increasing amount of I4OC data). We immediately see some indication of the availability of the cited papers. The more yellow, the more available. However, keep in mind that this is based on "full text availability" information in Wikidata, which is very sparse. That makes Bianca's approach so powerful: it uses (the equally wonderful)

You also note the immediate quality issues. Apparently, this data tells me I am citing articles from the future :) You also see that I am citing some really old articles.

Saturday, December 16, 2017

The National Plan Open Science Estafette: my own first Open Science steps

Noot: als je liever Nederlands leest, lees dan het origineel.
Earlier this year Delft hosted a meeting for Dutch scholars aimed at hearing and learning about, and to give feedback on the National Plan Open Science (doi:10.2777/061652). I'm very happy I have been able to contribute to this effort, because more and better access to knowledge is very dear to me. During lunch time everyone could demonstrate their own Open Science. From this the idea evolved to have a relay race ("estafette"). In each part of the relay someone will tell about their Open Science story. This post is te start: every next runner tells their story on what role Open Science has in their research. And it does not matter if the focus is on Open Data, Open Access, or Open Source, because the diversity in the Dutch Open Science community is just very high.

My Open Science story goes back to the time that I was studying chemistry at what is now called the Radboud Universiteit. Chemistry students could get access to the internet in 1994 and this opened a world of Open knowledge to me! Our library was well stocked, but I still had to visit research department to read certain journals. Always uncomfortable as a young student to walk into a coffee room with senior researchers.

I learned HTML and later Java. Java, with their applets, brought the internet to life. It could visualize 3D models of chemical structures. A paper journal cannot do that. Twenty years later journals still don't have this functionality, but that's not the point. In those three, four years I got introduced to three projects, each Open Source, Jmol (now JSmol), JChemPaint, and the file format "Chemical Markup Language" (CML). The first was to visualize 3D structures on the internet and the second was to visualize 2D chemical diagrams. CML was a format that could store 2D and 3D coordinates for me. But the problem was that neither Jmol nor JChemPaint could read CML.

But that's where Open Science comes in. After all, I could download the Jmol and JChemPaint source code, change it, and share that with others. That was brilliant! And I dived in. Of course, I could have just used my changes myself, but because I realized it could benefit others too, I sent my changes ("patches") to the authors of Jmol and JChemPaint. Extremely happy and proud I was when the two researchers from Germany and the U.S.A. included those patches in their version!

And in the end it was not in vain. In the final year of my chemistry study I submitted an abstract to an international conference. It got accepted! But now I had to go to Washington (Georgetown, to be precise), to talk about my work. On top of that, we agreed to meet the authors of Jmol and JChemPaint in South Bend, where we laid the foundation of a new Open Science project, the Chemistry Development Kit (CDK). Expensive trip, but fortunately I got a bursary from a Dutch company. A peculiar trip it was. We used an Amtrak sleeping train and had dinner with a soldier who served during D-Day. In New York I stepped off the sidewalk onto the street to evade a group of scary heavy boys (which turned out to be a popular boys band), and we stood in the WTC (a year before 9/11) to hear two tourists ask at the musical ticket sale desk "what broadway was?".

I am proud that I have been able to contribute to these Open Science projects and that I co-founded the CDK. The Open nature of these projects have had a significant impact and, after twenty years, still do. Sure, it's not the same is discovering a new protein or metabolite, but these projects definitely not only benefited my research. Of course, also with a huge thanks to Hens Borkent, Dick Wife, Dan Gezelter, Christoph Steinbeck, and Peter Murray-Rust.

BTW, thinking about this relay race, Open Science itself is also a relay race: you take the token of the people before you, adopt the token, and pass on the token to the next scientist. And every day the token gets brighter!

This Nationaal Plan Open Science Estafette also continues. I am delighted to pass my token to Rosanne Hertzberger. Read her story here (in Dutch).

Friday, December 15, 2017

Suggestions for ScholarlyHub

Mock Dashboard of ScholarlyHub.
(I'll assume CC-BY for this image.) 
ScholarlyHub is a open scholar profile project. I have yet no idea where this platform is going, but they planned open source nature makes me want to explore it nevertheless. The project is currently running a crowdfunding campaign and developing their plans. They asked for feedback, so here goes:

Feature requests:
  • researchers care about research, more than profiles: make things from their research ("topics") part of their profile; let them tell everyone what they are interested in
  • the website should have an API (good looks is not enough). Have you done a persona analysis? User friendly is only defined if you have defined your users.
  • make the resource FAIR: use or RDFa
  • show innovation into new scholarly activities: provide peer review functionality, etc (similar to Publons, PubPeer, PubMed Commons, etc)
  • support data and software citations
  • use identifiers (DOI, ORCID, project IDs (CORDIS, etc), etc)
  • integration of I4OC
  • freely provide #altmetrics
  • release soon, release often
  • use RSS for any bit of information on the site (one form of API, in fact)
  • integrate my social feeds into my profile (Twitter, blog, LinkedIn, etc)
You can browse my blog for other features I have recommended to websites in the past. You can also check Scholia for ideas.

New paper: "Integration among databases and data sets to support productive nanotechnology: Challenges and recommendations"

Figure 1 from the NanoImpact article. CC-BY.
The U.S.A and European nanosafety communities have a longstanding history of collaboration. On both sides there are working groups, NanoWG and WG-F (previously called WG4) of the NanoSafety Cluster. I have been chair of WG4 for about three years and still active in the group, though in the past half year, without dedicated funding, less active. That is already changing again with the imminent start of the NanoCommons project.

One of these collaborations resulted in a series of papers around data curation (see doi:10.1039/C5NR08944A and doi:10.3762/bjnano.6.189). Part of this effort was also an survey about the state of databases. A good number of databases responded to the call. It turned out non-trivial to analyse the results and write up a report around it with recommendations. The first version was submitted and rejected, and with fresh leadership, the paper underwent a significant restructuring by John Rumble and resubmitted to Elsevier's NanoImpact and now online (doi:10.1016/j.impact.2017.11.002).

The paper outlines an overview of challenges and a recommendation to the community on how to proceed. That is, basically, how should projects like eNanoMapper, caNanoLab, and Nanomaterial Registry evolve to, and what might the European Union Observatory for Nanomaterials (EUON) look like. BTW, a similar paper by Tropsha et al. was recently published the other week with a focus on the USA database ecosystem (doi:10.1038/nnano.2017.233).

Have fun reading it, and if you are working in a related field, please join either of the two aforementioned working groups! And a huge thanks to everyone involved, particular Sandra, John, and Christine.

Saturday, December 09, 2017 every house a library

Two weeks ago someone made me aware of This new social profile page allows you to digitize a list of your books. That in itself is already very useful. In the past I have made some efforts to create overviews of the books I own. If only to get some count, with respect to insurance, etc. But a very neat feature is the ability to find other people in your neighborhood with whom you may exchange books: for each book you can indicate who can see you own that book (no one, by default; if not mistaken, inheriting the ugw approach of UNIX), and if others can borrow or buy this book.

This shows users with public profiles in The Netherlands, showing the uptake has not been massive yet, but I'm hoping this post changes that :)

But the killer feature I am waiting for is a map like this, but then for books. Worldcat has a feature where it lists you the nearest library that has a copy of the book you are looking for:

And such a feature (map or list) would be brilliant for every house would potentially be a library. That idea appeals to me.

Oh, and it's Wikidata-based but that should hardly be a surprise.

Sunday, December 03, 2017

De Nationaal Plan Open Science Estafette: mijn eerste Open Science stappen

Note: For my English-only followers, I translated this to English.
Eerder dit jaar werd er in Delft een meeting voor Nederlandse onderzoekers georganiseerd om te horen over en feedback te geven op het Nationaal Plan Open Science (doi:10.2777/061652). Ik prijs me rijk dat ik hieraan heb kunnen en mogen meedenken, want meer toegankelijkheid tot wetenschappelijke kennis ligt me aan het hart. Tijdens de lunch konden mensen hun Open Science laten zien. Hieruit vloeide het idee voort om een estafette op te zetten om zo veel mogelijk te laten zien hoe veel Open Science er eigenlijk al in Nederland gedaan wordt. Hierbij de start: elke volgende estafetteloper vertelt iets over hun Open Science onderzoek. En of de focus nu op Open Data, Open Access, of op Open Source is maakt niet zo veel uit. Want de diversiteit in het Nederlandse Open Science-gemeenschap is nu eenmaal groot.

Mijn Open Science-verhaal gaat terug naar de tijd dat ik student scheikunde was aan wat nu de Radboud Universiteit heet. Wij kregen daar in 1994 toegang tot het internet, en dit opende voor mij een wereld van Open kennis! Onze bibliotheek was goed gevuld, maar soms moest ik naar afdelingen om daar bepaalde tijdschriften in te kunnen kijken. Altijd ongemakkelijk om een koffiekamer met senior onderzoekers binnen te lopen als student.

Ik leerde HTML en later Java. Java, met hun applets, brachten het internet tot leven. Het kon 3D modellen van chemische structuren laten zien. Dat kan een tijdschrift niet. Twintig jaar later kunnen tijdschriften dat nog steeds niet, maar dat terzijde. In de drie, vier jaren daarna leerde ik drie projecten kennen, elk Open Source: Jmol (nu JSmol), JChemPaint, en het bestandsformaat "Chemical Markup Language" (CML). De eerste was om 3D structuren te laten zien op het internet en het tweede visualiseerde twee-dimensionale (2D) chemische structuren. CML was een formaat waarin ik zowel 2D- en 3D-coördinaten kon opslaan. Maar het ding was dat Jmol en JChemPaint helemaal geen CML konden lezen.

Maar daar kwam Open Science om de hoek kijken. Immers, ik kon van Jmol en JChemPaint de broncode downloaden, aanpassen, en delen met anderen. Ik was overtuigd! En ging aan de slag. Natuurlijk had ik mijn aanpassingen gewoon zelf kunnen gebruiken, maar omdat ik dacht dat het voor andere misschien ook wel bruikbaar zou kunnen zijn, stuurde ik mijn aanpassingen ("patches") naar de auteurs van Jmol en JChemPaint. Dolgelukkig en trots was ik toen de wetenschappers in Duitsland en de Verenigde Staten het in hun versie opnamen!

En het heeft me allemaal geen windeieren gelegd. In mijn laatste jaar van mijn studie heb ik een abstract naar een internationale conferentie ingestuurd. Die werd geaccepteerd en dus moest ik naar de Washington (Georgetown, om precies te zijn), om wat over mijn werk te vertellen. Maar bovendien had ik met de auteurs van Jmol en JChemPaint afgesproken in South Bend, waar we de basis gelegd hebben voor een nieuw Open Science project, de Chemistry Development Kit (CDK). Duur, maar gelukkig kreeg ik een beurs van een Nederlands bedrijf. Een wonderlijke reis was het. In de slaaptrein avondeten met een soldaat die tijdens D-Day in actie is geweest, in New York van de stoep stappen omdat er zware jongens aankomen (die een bekende boys band blijken te zijn), en in het WTC staan (een jaar voor 9/11) en horen hoe toeristen bij de musicalticketverkoop vragen "What is broadway?".

Ik ben trots dat ik aan deze Open Science projecten heb kunnen bijdragen en dat ik medeontwerper ben van de CDK. Door hun Open karakter hebben deze projecten in flinke impact gehad, en na twintig jaar, nog steeds hebben. Natuurlijk, het is niet de functie van een eiwit en metaboliet, maar deze projecten hebben zeker niet alleen mijn onderzoek geholpen. Met dank aan anderen, natuurlijk: Hens Borkent, Dick Wife, Dan Gezelter, Christoph Steinbeck, en Peter Murray-Rust.

Trouwens, over estafettes gesproken, Open Science is op zichzelf ook een estafette: je neemt het stokje over van de mensen voor je, geeft er je eigen draai aan, en geeft daarna het stokje weer door aan de volgende. En het stokje wordt elke dag mooier!

Deze Nationaal Plan Open Science Estafette gaat ook verder. Ik mag mijn stokje doorgeven aan Rosanne Hertzberger. Ik ben superbenieuwd naar haar Open Science verhaal! En natuurlijk van alle lopers die daarna aan de estafette deelnemen!