Pages

Monday, December 31, 2018

Wikidata-Taxonomy: class and instance hierarchies on the command line

For some time I had a 2017 Tweet from Dan Brickley on my todo list (I use Todoist), and now that it is holiday, I finally had time to play with Wikidata-Taxonomy. Here's it in action for a class of five phytocassanes:

$ node wdtaxonomy.js  Q60224961 -i


Give it a try.

Friday, December 28, 2018

Replacing BibTeX with Citation.js

As part of replacing LaTeX with Markdown for my Groovy Cheminformatics book (now Open Access), I also needed to replace BibTex. Fortunately, Citation.js supports Wikidata and the solution by Lars was simpler than I hoped. Similar to LaTeX, I have citations annotated in the Markdown, but the reference code does not refer to a BibTeX file entry, but to Wikidata (see also Wikidata-powered citation lists with citation.js).

The set up is as follows:
  1. extract the Wikidata Q-codes (which creates references.qids)
  2. using Citation.js to format the reference as plain text
  3. number of the citations and create the bibliography
The first step uses a Groovy script, and the second a very short JavaScript script:

fs.readFile('references.qids', 'utf8',
            async function (err, file) {
  const data = Array.from(await Cite.async(file)).map(
    item => item.id + '=' + Cite(item).format(
      'bibliography', {template: 'vancouver'}
    )
  )
  fs.writeFile('references.dat', data.join(''),
    function() {}
  )
})

The result looks like:



I have yet some things left to do, like add the DOI, and add some Markdown formatting. But the toolkit allows that but also is not urgent.

Thursday, December 27, 2018

Creating nanopublications with Groovy

Compound found in Taphrorychus bicolor
(doi:10.1002/JLAC.199619961005).
Published in Liebigs Annalen, see
this post about the history of that journal.
Yesterday I struggled some with creating nanopublications with Groovy. My first attempt was an utter failure, but then I discovered Thomas Kuhn's NanopubCreator and it was downhill from there.

There are two good things about this. First, I now have a code base that I can easily repurpose to make trusty nanopublications (doi:10.1007/978-3-319-07443-6_63) about anything structured as a table (so can you).

Second, I now about almost 1200 CCZero nanopublications that tell you in which species a certain metabolite has been found. Sourced from Wikidata, using their SPARQL end point. This collection is a bit boring that this moment, and most of them are human metabolites, where the source is either Recon 2.2 or WikiPathways. But I expect (hope) to see more DOIs to show up. Think We challenge you to reuse Additional Files.

Finally, you are probably interested in learning what one of the created nanopublications looks like, to I put a Gist online:


Wednesday, December 26, 2018

Groovy Cheminformatics rises from the ashes

Cover of the last print
version of the book.
Like a phoenix (Phenix aegyptus), my Groovy Cheminformatics rises from the ashes. About a year ago I blogged that I could not longer maintain my book, not in the print form. The hardest part was actually resizing the cover each time the book got thicker. I actually started the book about 10 years ago, but the wish to make it Open Access grew bigger with the years.

So, here we go. It's based on CDK 2.0, but somewhere in the coming weeks I'll migrate to the latest version. It will take some weeks to migrate all content, and your chapter priority requests here.

The making of...
Over the past months I have been playing with some ideas on how to make the transition. I wanted to preserve the core concept of the book that all books are compiled and executed which each release and that all output of scripts is autogenerated (including many of the diagrams). I wanted to publish the next iteration of the book as Markdown, but also pondered with the idea of still being able to generate a PDF with LaTeX. That means I have a lot of stuff to upgrade.

I ended up somewhere in between. It's source is Markdown, but not entirely. It's source code that looks like Markdown with snippets of XML. This makes sure the source looks formatted when on GitHub:
But you can see that this is not processed yet. The CreateAtom1 and CreateAtom2 refers to code examples, and the above screenshot shows the source of a source code inclusion (for CreateAtom1 and CreateAtom2) and a output inclusion (for CreateAtom2). After processing, the actual page looks like this:


That looks pretty close to what the print book had. An extra here is that you can click (hard in a print book) the link to the code. That is something I improved on along the way, and leads to a Markdown (new) page that shows the full sources and the output (should I add the @Grab instructions, or too obvious?):


If you check the first online version (🎶 On the first day of xmas, #openscience got from me ... 🎶), I have quite some content to migrate. First, back to doing the reference sections properly, as if I was still working with BibLaTeX.

Happy holidays!

Saturday, December 22, 2018

About Frontiers

Frontiers is getting a lot of critique at this moment, about very low rejection rates (only ~10%), reviewers who seemingly cannot reject articles, the use of the impact factor (sad), their almost pyramid-like gaming of recruiting editors, reviewers, etc are questionable to me (focused on continuous growth of literature, which we must not want), and perhaps most important, questionable lobbying around Plan S. Also, they are just expensive and I see little real publishing innovation.

For Marvin Martens' paper we received fair quality reviewers. But with the above points in mind, retrospectively, I want to comment. For this paper we had a reviewer that withdrew, and while they provided feedback, we could not directly reply to this reviewer, and we had to direct our replies and updates based on those reviews to the editor instead.

But I note that the the "major + withdrew + minor" we received could just as well have been (my personal interpretation based on the reviewers' comments) a "major + reject + minor". The third review was based on our revision and we took into account the reviews of both the major and reject review. For me as editor, a "major + reject" often results in a "back to the drawing board" decision. For this paper we were lucky, and the reject was mostly about the excellent note by the reviewer that our article was wrongly submitted as a review article, which we corrected (should have been "Hypothesis and Theory" for a positioning paper).

I'll carefully monitor where Frontiers is going, but their prominent use of the impact factor and the intention to keep increasing the volume of journal article literature alone is reason enough for me to not quickly consider them again. We have a second paper under review with Frontiers, but I will have a moratorium on Frontiers until further notice.

BTW, if you like to see journals publish their rejection rates, please RT this tweet:

New paper: "Introducing WikiPathways as a Data-Source to Support Adverse Outcome Pathways for Regulatory Risk Assessment of Chemicals and Nanomaterials"

An adverse outcome pathway (AOP) links
molecular initiating events (MIEs) via key
events (KEs) to the adverse outcome (AO).
Each event is a biological process and it should
be able to link them to normal biological
pathways (PWs). Figure from the paper.
Marvin Martens published his vision on the integration of adverse outcome with biological pathways (doi:10.3389/fgene.2018.00661). Specifically, he looked into our options to link the AOPWiki with WikiPathways, taking input from various people around the world (see the list of co-authors). The paper looks into how links can be made, and some statistics are calculated for genes mentioned in AOPs and biological pathways, as well as seeing which molecular initiators are found in biological pathways (see the figure on the right).

The paper started out as positioning paper but I was happy to see that Marvin could not resist getting some actual data and include that as well. With the code available from GitHub and archived on Zenodo (doi:10.5281/ZENODO.1306408). The next step is to formalize this integration, and the first bits of data are being produced and they look very exciting!

BTW, if you like where this is going, also make sure to read this paper by Dr. Penny NymarkA Data Fusion Pipeline for Generating and Enriching Adverse Outcome Pathway Descriptions).

Monday, December 17, 2018

From the "Annalen der Pharmacie" to the "European Journal of Organic Chemistry"

2D structure of caffeine, also
known as theine.
One of my hobbies is the history of chemistry. It has a practical use to my current research, as a lot of knowledge about human metabolites is actually quite ancient. One thing I have trouble understanding that in a time where Facebook knows you better than your spouse, we have trouble finding relevant literature without expensive, expert databases, not generally available.

Hell, even the article that established that some metabolite is actually a human metabolite is not found within reasonable time (less than a minute).

This is one of the reasons I started working on Scholia, and the chemistry corner of it specifically. See this ICCS conference poster. The poster outlines some of the reasons why I like it, but one is this link between chemical structures and literature, here for caffeine:


You can see the problem with our chemical knowledge here (in Wikidata): before 1950 it's pretty blank. Hence my question on Twitter what journal to look at. A few suggestions came back, and I decided to focus on the journal that is now called the European Journal of Organic Chemistry but that started in 1832 as the Annalen der Pharmacie. I remember the EurJOC being launched by the KNCV and many other European chemistry societies.

BTW, note here that all these chemistry societies decided it was better to team up with a commercial publisher than to continue publishing it themselves. #Plan_S

Anyway, the full history is not complete, but the route from Annalen to EurJOC now is (each journal name has a different color):


That took me an hour or two, because CrossRef has for all articles the EurJOC journal name. Technically perhaps correct, but metadata-wise the above is much better. Thanks to whomever actually created Wikidata items for each journal and linking them by follows and followed by.

In doing so, you quickly run into many more metadata issues. The best one I found was a paper by Crasts and Friedel, known for the Friedel-Crafts reaction :) Other gems are researcher names like Erlenmeyer-Heidelberg and Demselben and Von Demselben.

Back to caffeine, an active chemical in coffee, a chemical many of us must have in the morning, is actually the same as theine. Tea drinkers also get their dose of caffeine. We all know that. What I did not know, but discovered while doing this work, is that already established that :caffeine owl:sameAs :theine (doi:10.1002/jlac.18380250106). Cool!

Saturday, November 17, 2018

Join me in encouraging the ACS to join the Initiative for Open Citations

My research is into abstract representation of chemical information, important for other research to be performed. Indeed, my work is generally reused, but knowing which research fields my work is used in, or which societal problems it is helping solve, is not easily retrieved or determined. Efforts like WikiCite and Scholia do allow me to navigate the citation network, so that I can determine which research fields my output influences and which diseases are studied with methods I proposed. Here's a network of topics of articles citing my work:


Graphs like this show information on how people are using my work, which in turn allows me to further support. But this relies on open citations.

In my opinion, citations are an essential part of our research process. It gives us access to import prior work on which a study is based, and reflects how a work influences other research or even is essential to that other work. For example, it allows us to not repeat earlier published work, while preserving the ability to reproduce the full work. The Initiative for Open Citations encourages these citations to be publicly available to benefit research, but removing barriers to access this critical part of scholarly communication. While many societies and publishers have joined this initiative, the American Chemical Society (ACS) has not yet. By not joining the limit the sharing of knowledge for unclear reasons.

And I would really like to see the ACS to join this initiative, and proposed this a few times already. Because they still have not joined the initiative, I have started this petition. If you agree, please sign and share it with others.

New paper: "Explicit interaction information from WikiPathways in RDF facilitates drug discovery in the Open PHACTS Discovery Platform"

Figure from the article showing the interactive
Open PHACTS documentation to access
interactions.
Ryan, PhD candidate in our group, is studying how to represent and use interaction information in pathway databases, and WikiPathways specifically. His paper Explicit interaction information from WikiPathways in RDF facilitates drug discovery in the Open PHACTS Discovery Platform (doi:10.12688/f1000research.13197.2) was recently accepted in F1000Research, which extends on work started by, among others, Andra (see doi:10.1371/journal.pcbi.1004989).

The paper describes the application programming interfaces (API) methods of the Open PHACTS REST API for accessing interaction information, e.g. to learn which genes are upstream of downstream in the pathway. This information can be used in pharmacological research. The paper discussed examples queries and demonstrates how the API methods can be called from HTML+JavaScript and Python.

Sunday, November 04, 2018

Programming in the Life Sciences #23: research output for the future

A random public domain
picture with 10 in it.
Ensuring that you and others can understand you research output five years from now requires effort. This is why scholars tend to keep lab notebooks. The computational age has perhaps made us a bit lazy here, but we still make an effort. A series of Ten Simple Rules articles outline some of the things to think about:
  1. Goodman A, Pepe A, Blocker AW, Borgman CL, Cranmer K, Crosas M, et al. Ten Simple Rules for the Care and Feeding of Scientific Data. Bourne PE, editor. PLoS Computational Biology. 2014 Apr 24;10(4):e1003542.
  2. List M, Ebert P, Albrecht F. Ten Simple Rules for Developing Usable Software in Computational Biology. Markel S, editor. PLOS Computational Biology. 2017 Jan 5;13(1):e1005265.
  3. Perez-Riverol Y, Gatto L, Wang R, Sachsenberg T, Uszkoreit J, Leprevost F da V, et al. Ten Simple Rules for Taking Advantage of Git and GitHub. Markel S, editor. PLOS Computational Biology. 2016 Jul 14;12(7):e1004947.
  4. Prlić A, Procter JB. Ten Simple Rules for the Open Development of Scientific Software. PLoS Computational Biology. 2012 Dec 6;8(12):e1002802.
  5. Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten Simple Rules for Reproducible Computational Research. Bourne PE, editor. PLoS Computational Biology. 2013 Oct 24;9(10):e1003285.
Regarding licensing, I can highly recommend reading this book:
  1. Rosen L. Open Source Licensing [Internet]. 2004. Available from: https://www.rosenlaw.com/oslbook.htm
Regarding Git, I recommend these two resources:
  1. Wiegley J. Git From the Bottom Up [Internet]. 2017. Available from: https://jwiegley.github.io/git-from-the-bottom-up/
  2. Task 1: How to set up a repository on GitHub [Internet]. 2018. Available from: https://github.com/OpenScienceMOOC/Module-5-Open-Research-Software-and-Open-Source/blob/master/content_development/Task_1.md

Saturday, November 03, 2018

Fwd: "We challenge you to reuse Additional Files (a.k.a. Supplementary Information)"

Download statistics of J. Cheminform.
Additional Files show a clear growth.
Posted on the BMC (formerly BioMedCentral) Research in progress blog our challenge to you to reuse additional files:
    Since our open-access portfolio in BMC and SpringerOpen started collaborating with Figshare, Additional Files and Supplementary Information have been deposited in journal-specific Figshare repositories, and files available for the Journal of Cheminformatics alone have been viewed more than ten thousand times. Yet what is the best way to make the most of this data and reuse the files? Journal of Cheminformatics challenges you to think about just that with their new upcoming special issue.
We already know you are downloading the data frequently and more every year, so let us know what you're doing with that data!

For example, I would love to see more data from these additional files end up in databases, such as Wikidata, but any reuse in RDF form would interest me.

Tuesday, October 30, 2018

Some steps needed in knowledge dissemination

Last week I had holiday from my BiGCaT position and visited Samuel Winthrop at the SpringerNature offices to discuss our Journal of Cheminformatics. It was a great meeting (1.5 days) and we discussed a lot of things we could do (or are in the process of doing) to improve using the journal format for knowledge dissemination. We have some interesting things lined up ... </suspense>

For now, check out my personal views in these slides I presented last week:


Thursday, October 11, 2018

Two presentations at WikiPathways 2018 Summit #WP18Summit

Found my way back to my room a few kilometers from the San Francisco city center, after a third day at the WikiPathways 2018 Summit at the Gladstone Institutes in Mission Bay, celebrating 10 years of the project, which I only joined some six and a half years ago.

The Summit was awesome and the whole trip was awesome. The flight was long, with a stop in Seattle. I always get a bit nervous of lay-overs (having missed my plane twice before...), but a stop in Seattle is interesting, with a great view of Mt. Rainier, which is also from an airplane quite a sight. Alex picked us up from the airport and the Airbnb is great (HT to Annie for being a great host), from which we can even see the Golden Gate Bridge.

The Sunday was surreal. With some 27 degrees Celsius the choice to visit the beach and stand, for the first time, in the Pacific was great. I had the great pleasure to meet Dario and his family and played volleyball at a beach for the first time in some 28 years. Apparently, there was an airshow nearby and several shows were visible from our site, including a very long show by the Blue Angels.
Thanks for a great afternoon!

Sunday evening Adam hosted us for an WikiPathways team dinner. His place gave a great view on San Francisco, the Bay Bridge, etc. Because Chris was paying attention, we actually got to see the SpaceX rocket launch (no, my photo is not so impressive :). Well, I cannot express in words how cool that is, to see a rocket escape the earth gravity with your own eyes.

And the Summit had not even started yet.

I will have quite a lot to write up about the meeting itself. It was a great line up of speakers, great workshops, awesome discussions, and a high density of very knowledgeable people. I think we need 5M to implement just the ideas that came up in the past three days. And it would be well invested. Anyway, more about that later. Make sure to keep an eye on the GitHub repo for WikiPathways.

That leave me only, right now, to return to the title of this post. And below they are, my two contributions to this summit:





Saturday, September 29, 2018

Two presentations of last week: NanoTox 2018 and the BeNeLuX Metabolomics Days

Slide from the BeNeLux Metabolomics Days
presentation (see below).
The other week I gave a two presentations, one at the BeNeLux Metabolomics Days in Rotterdam and the next day one at NanoTox 2018 in Neuss, Germany. During the first I spoke about research ongoing in our research group and in Neuss about the eNanoMapper project and some of the ongoing eNanoMapper projects I am involved in.

Here are the slides of both talks.




Sunday, September 16, 2018

Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)

Slice of the spreadsheet in the supplementary info.
Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the zero-APC Open Access (CC_BY) journal Beilstein of Journal of Nanotechnology, NanoE-Tox (doi:10.3762/bjnano.6.183).

The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:10.3762/bjnano.6.165) and thus with ontology annotation (see doi:10.1186/s13326-015-0005-5):

  recognizedToxicities = [
    "EC10": "http://www.bioassayontology.org/bao#BAO_0001263",
    "EC20": "http://www.bioassayontology.org/bao#BAO_0001235",
    "EC25": "http://www.bioassayontology.org/bao#BAO_0001264",
    "EC30": "http://www.bioassayontology.org/bao#BAO_0000599",
    "EC50": "http://www.bioassayontology.org/bao#BAO_0000188",
    "EC80": "http://purl.enanomapper.org/onto/ENM_0000053",
    "EC90": "http://www.bioassayontology.org/bao#BAO_0001237",
    "IC50": "http://www.bioassayontology.org/bao#BAO_0000190",
    "LC50": "http://www.bioassayontology.org/bao#BAO_0002145",
    "MIC":  "http://www.bioassayontology.org/bao#BAO_0002146",
    "NOEC": "http://purl.enanomapper.org/onto/ENM_0000060",
    "NOEL": "http://purl.enanomapper.org/onto/ENM_0000056"
  ]  

With 402(!) variants left. Many do not have an ontology term yet, and I filed a feature request.

Units:

  recognizedUnits = [
    "g/L": "g/L",
    "g/l": "g/l",
    "mg/L": "mg/L",
    "mg/ml": "mg/ml",
    "mg/mL": "mg/mL",
    "µg/L of food": "µg/L",
    "µg/L": "µg/L",
    "µg/mL": "µg/mL",
    "mg Ag/L": "mg/L",
    "mg Cu/L": "mg/L",
    "mg Zn/L": "mg/L",
    "µg dissolved Cu/L": "µg/L",
    "µg dissolved Zn/L": "µg/L",
    "µg Ag/L": "µg/L",
    "fmol/L": "fmol/L",
    
    "mmol/g": "mmol/g",
    "nmol/g fresh weight": "nmol/g",
    "µg Cu/g": "µg/g",
    "mg Ag/kg": "mg/kg",
    "mg Zn/kg": "mg/kg",
    "mg Zn/kg  d.w.": "mg/kg",
    "mg/kg of dry feed": "mg/kg", 
    "mg/kg": "mg/kg",
    "g/kg": "g/kg",
    "µg/g dry weight sediment": "µg/g", 
    "µg/g": "µg/g"
  ]

Oh, and don't get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:

  prop = prop.replace(",", ".")
  if (prop.substring(1).contains("-")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("±")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("<")) {
  } else {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${ssoNS}has-value", prop,
      "${xsdNS}double"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  }

But let me make clear: I can actually do this, add more data to the eNanoMapper database (with Nina), only because the developers of this database made their data available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They already did a lot of data curation. But only because of Open licenses, we only have to do this once.

Saturday, September 15, 2018

Wikidata Query Service recipe: qualifiers and the Greek alphabet

Just because I need to look this up each time myself, I wrote up this quick recipe for how to get information from statement qualifiers from Wikidata. Let's say, I want to list all Greek letters, with in one column the lower case and in the other the upper case letter. This is what our data looks like:


So, let start with a simple query that lists all letters in the Greek alphabet:

SELECT ?letter WHERE {
  ?letter wdt:P361 wd:Q8216 .
}

Of course, that only gives me the Wikidata entries, and not the Unicode characters we are after. So, let's add that Unicode character property:

SELECT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
}

Ah, that gets us somewhere:



But you see that the upper and lower case are still in separate rows, rather than columns. To fix that, we need access to those qualifiers. It's all in there in the Wikidata RDF, but the model is giving people a headache (so do many things, like math, but that does not mean we should stop doing it!). It all comes down to keeping notebooks, write down your tricks, etc. It's called the scientific method (there is more to that, than just keeping notebooks, tho).

Qualifiers
So, a lot of important information is put in qualifiers, and not just the statements. Let's first get all statements for a Greek letter. We would do that with:

?letter ?pprop ?statement .

One thing we want to know about the property we're looking at, is the entity linked to that. We do that by adding this bit:

?property wikibase:claim ?propp .

Of course, the property we are interested in is the Unicode character, so can put that directly in:

wd:P487 wikibase:claim ?propp .

Next, the qualifiers for the statement. We want them all:

?statement ?qualifier ?qualifierVal .
?qualifierProp wikibase:qualifier ?qualifier .

And because we do not want any qualifier but the applies to part, we can put that in too:

?statement ?qualifier ?qualifierVal .
wd:P518 wikibase:qualifier ?qualifier .

Furthermore, we are only interested in lower case and upper case, and we can put that in as well (for upper case):

?statement ?qualifier wd:Q98912 .
wd:P518 wikibase:qualifier ?qualifier .

So, if we want both upper and lower case, we now get this full query:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp .
  ?statement ?qualifier wd:Q8185162 .
  wd:P518 wikibase:qualifier ?qualifier .
}

We are not done yet, because you can see in the above example that we get the unicode character differently from the statement. This needs to be integrated, and we need the wikibase:statementProperty for that:

wd:P487 wikibase:statementProperty ?statementProp .
?statement ?statementProp ?unicode .

If we integrate that, we get this query, which is indeed getting complex:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier wd:Q8185162 ;
             ?statementProp ?unicode .  
  wd:P518 wikibase:qualifier ?qualifier .
}

But basically we have our template here, with three parameters:
  1. the property of the statement (here P487: Unicode character)
  2. the property of the qualifier (here P518: applies to part)
  3. the object value of the qualifier (here Q98912: upper case)
If we use the SPARQL VALUES approach, we get the following template. Notice that I renamed the variables of ?letter and ?unicode. But I left the wdt:P361 wd:Q8216 (='part of' 'Greek alphabet') in, so that this query does not time out:

SELECT DISTINCT ?entityOfInterest ?statementDataValue WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 . # 'part of' 'Greek alphabet'
  VALUES ?qualifierObject { wd:Q8185162 }
  VALUES ?qualifierProperty { wd:P518 }
  VALUES ?statementProperty { wd:P487 }

  # template
  ?entityOfInterest ?pprop ?statement .
  ?statementProperty wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier ?qualifierObject ;
             ?statementProp ?statementDataValue .  
  ?qualifierProperty wikibase:qualifier ?qualifier .
}

So, there is our recipe, for everyone to copy/paste.

Completing the Greek alphabet example
OK, now since I actually started with the upper and lower case Unicode character for Greek letters, let's finish that query too. Since we need both, we need to use the template twice:

SELECT DISTINCT ?entityOfInterest ?lowerCase ?upperCase WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
}

Still one issue left to fix. Some greek letters have more than one upper case Unicode character. We need to concatenate those. That requires a GROUP BY and the GROUP_CONCAT function, and get this query:

SELECT DISTINCT ?entityOfInterest
  (GROUP_CONCAT(DISTINCT ?lowerCase; separator=", ") AS ?lowerCases)
  (GROUP_CONCAT(DISTINCT ?upperCase; separator=", ") AS ?upperCases)
WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
} GROUP BY ?entityOfInterest

Now, since most of my blog posts are not just fun, but typically also have a use case, allow me to shed light on the context. Since you are still reading, your officially part of the secret society of brave followers of my blog. Tweet to my egonwillighagen account a message consisting of a series of letters followed by two numbers (no spaces) and another series of letters, where the two numbers indicate the number of letters at the start and the end, for example, abc32yz or adasgfshjdg111x, and I will you add you to my secret list of brave followers (and I will like the tweet; if you disguise the string to suggest it has some meaning, I will also retweet it). Only that string is allowed and don't tell anyone what it is about, or I will remove you from the list again :) Anyway, my ambition is to make a Wikidata-based BINAS replacement.

So, we only have a human readable name. The frequently used SERVICE wikibase:label does a pretty decent job and we end up with this table:


Sunday, September 09, 2018

cOAlition S with their Plan S

Screenshot of the Scholia page for Plan S.
Last Monday the bomb dropped: eleven European funders (with likely more to follow) indicate that they are not going to support journals that are not fully Open Access, i.e. fully are partially paywalled journals: cOAlition S announced Plan S.

There is a lot of news about this and a lot of discussion: many agree that it is at least an interesting step. Some have argued that the plan advocates a commercial publishing model and that it accepts glossy journals. Indeed, it does not specifically address those points, but other news also suggests that this is Plan S is not the only step funders are undertaking: for example, the Dutch NWO is also putting serious effort in fighting the use of the flawed impact factor.

One thing I particularly like about this Plan S is that it counters joined efforts from our universities (via the VSNU) with their big package deals that currently favor hybrid journals over full Open Access journals. That is, I can publish my cheminformatics under a Creative Commons license for free in the hybrid JCIM, where I do not get similar funding for the full Open Access JCheminform.

Another aspect I like is that it insists on the three core values of Open Science, the rights to:

  1. reuse,
  2. modify, and
  3. share.
I cannot stress this enough. Only with these core values, we can build on earlier knowledge and earlier research. It is worth reading all ten principles.

Keeping up to date
We will see a lot of analyses of what will happen now. Things will have to further unfold. We will see that other funders will join, and we have seen that some funders did not join yet, because they were unsure if they could make the time line (like the Swedish VR). There are a few ways to keep updated. First, you can use the RSS feed of Scholia for both the Plan S and the cOAlition S (see the above screenshot). But this is mostly material with an DOI and not general news. Second, you could follow the oa.plan_s and oa.coalitions tags of the Open Access Tracking Project.

Bianca Kramer has used public data to make an initial assessment of the impact of the plan (full data):
Screenshot of the graph showing contributions or license types to literature for ten of the eleven
research funders. Plan S ensures that these bare become 100% pure gold (yellow).
It was noted (cannot find the tweet, right now...) that the amount of literature based on funding from cOAlition S is only 1-2% of all European output. That's not a lot, but keep in mind: 1. more funders will join, 2. a 1-2% extra pressure will make shareholders think, 3. the Plan S stops favoring hybrid journals over Open Access journals, and 4. the percentage extra submissions to full Open Access journals will be significantly higher compared to their current count.

Saturday, September 08, 2018

Also new this week: "Google Dataset Search"

There was a lot of Open Science news this week. The announcement of the Google Dataset Search was one of them:


 Of course, I first tried searching for "RDF chemistry" which shows some of my data sets (and a lot more):


It picks up data from many sources, such as Figshare in this image. That means it also works (well, sort of, as Noel O'Boyle noticed) for supplementary information from the Journal of Cheminformatics.

It picks up metadata in several ways, among which schemas.org. So, next week we'll see if we can get eNanoMapper extended to spit compatible JSON-LD for its data sets, called "bundles".

Integrated with Google Scholar?
While the URL for the search engine does not suggest the service is more than a 20% project, we can hope it will stay around like Google Scholar has been. But I do hope they will further integrate it with Scholar. For example, in the above figure, it did pick up that I am the author of that data set (well, repurposed from an effort of Rich Apodaca), it did not figure out that I am also on Scholar.

So, these data sets do not show up in your Google Scholar profile yet, but they must. Time will tell where this data search engine is going. There are many interesting features, and given the amount of online attention, they won't stop development just yet, and I expect to discover more and better features in the next months. Give it a spin!

Mastodon: somewhere between Twitter and FriendFeed

Now I forgot who told me about it (sorry!), but I started looking at Mastodon last week. Mastodon is like Twitter or Whatsapp, but then distributed and federated. And that has advantage: no vendor lock-in, servers for specific communities. In that sense, it's more like email, with neither one point-of-failure, but potentially many points-of-failure.

But Mastodon is also very well done. In the past week I set up two accounts, one of a general server and one on a community server (more about that in a second). I am still learning, but want to share some observations.

First, the platform is not unique and there are other, maybe better) distributed and federated software solutions, but Mastodon is slick. This is what my mastodon.social profile page looks like, but you can change the theme if you like. So far, pretty standard:

My @egonw@mastodon.social profile page.
Multiple accounts
While I am still exploring this bit, you can have multiple accounts. I am not entire sure yet how to link them up, but currently they follow each other. My second account is on a server aimed at scholars and stuff that scholars talk about. This distributed features is advertised as follows: sometimes you want to talk science, something you want to talk movies. The last I would do on my mastodon.social account and the science on my scholar.social account.

However, essential is that you can follow anyone on any server: you do not have to be on scholar.social to follow my toots there. (Of course, you can also simply check the profile page, and you can read my public toots without any Mastodon account.)

This topic server is a really exciting idea. This provides an alternative to mailing lists or slack rooms. And each server decides on their own community norms, and any account can be blocked of violating those community norms. No dependency on Twitter or Facebook to decide what is right or wrong, the community can do that themselves.

BioMedCentral could host one server for each journal... now that's an idea :)

Controlling what you see
OK, let's look at a single toot, here about a recent JACS paper (doi:10.1021/jacs.8b03913):


Each toot (like tweet) has replies, boosts (like retweets), and favorites (like likes). Now, I currently follow this anonymous account. You can do the normal things, like follow, send direct messages, mute, and block accounts:


You can follow toots from just the people you follow, but also follow all tweets on that particular server (which makes sense of you have a server about a very specific topic), or toots on all federated servers (unwise).

The intention of Mastodon is to give the users a lot of control. You should not expect non-linear timelines or promoted toots. If that is your things, better stay on Twitter. An example of the level of control is what options it offers me for my "Notifications" timeline:


Other stuff I like
Some random things that I noticed: there is more room for detail and you have 500 chars, URLs are not shortened (each URL counts as 20 chars), animated GIFs are animated when I hover over them. Cool, no need to pause them! You cannot edit toots, but at least I found a "Delete and redraft" option. Mastodon has solutions for hiding sensitive material. That causes to part of the toot to be hidden by default. This can be used to hide content that may upset people, like medical images of intestines :) The CW is short for Content Warning and is used for that.

There is a lot more, but I'm running out of time for writing this blog post. Check out this useful An Increasingly Less-Brief Guide to Mastodon.

So, who to follow?
Well, I found two options. One is, use Wikidata, where you can search for authors with a (one or more) Mastodon accounts. For example, try this query to find accounts for Journal of Cheminformatics authors:

Yes, that list is not very long yet:


But given the design and implementation of Mastodon, this could change quickly.

FriendFeed??
Well, some of you are old enough to remember FriendFeed. The default interface is different, but if I open a single toot in a separate page, it does remind me a lot of FriendFeed, and I am wondering in Mastodon can be that FriendFeed replacement we have long waited for! What do you think?