Pages

Thursday, October 11, 2018

Two presentations at WikiPathways 2018 Summit #WP18Summit

Found my way back to my room a few kilometers from the San Francisco city center, after a third day at the WikiPathways 2018 Summit at the Gladstone Institutes in Mission Bay, celebrating 10 years of the project, which I only joined some six and a half years ago.

The Summit was awesome and the whole trip was awesome. The flight was long, with a stop in Seattle. I always get a bit nervous of lay-overs (having missed my plane twice before...), but a stop in Seattle is interesting, with a great view of Mt. Rainier, which is also from an airplane quite a sight. Alex picked us up from the airport and the Airbnb is great (HT to Annie for being a great host), from which we can even see the Golden Gate Bridge.

The Sunday was surreal. With some 27 degrees Celsius the choice to visit the beach and stand, for the first time, in the Pacific was great. I had the great pleasure to meet Dario and his family and played volleyball at a beach for the first time in some 28 years. Apparently, there was an airshow nearby and several shows were visible from our site, including a very long show by the Blue Angels.
Thanks for a great afternoon!

Sunday evening Adam hosted us for an WikiPathways team dinner. His place gave a great view on San Francisco, the Bay Bridge, etc. Because Chris was paying attention, we actually got to see the SpaceX rocket launch (no, my photo is not so impressive :). Well, I cannot express in words how cool that is, to see a rocket escape the earth gravity with your own eyes.

And the Summit had not even started yet.

I will have quite a lot to write up about the meeting itself. It was a great line up of speakers, great workshops, awesome discussions, and a high density of very knowledgeable people. I think we need 5M to implement just the ideas that came up in the past three days. And it would be well invested. Anyway, more about that later. Make sure to keep an eye on the GitHub repo for WikiPathways.

That leave me only, right now, to return to the title of this post. And below they are, my two contributions to this summit:





Saturday, September 29, 2018

Two presentations of last week: NanoTox 2018 and the BeNeLuX Metabolomics Days

Slide from the BeNeLux Metabolomics Days
presentation (see below).
The other week I gave a two presentations, one at the BeNeLux Metabolomics Days in Rotterdam and the next day one at NanoTox 2018 in Neuss, Germany. During the first I spoke about research ongoing in our research group and in Neuss about the eNanoMapper project and some of the ongoing eNanoMapper projects I am involved in.

Here are the slides of both talks.




Sunday, September 16, 2018

Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)

Slice of the spreadsheet in the supplementary info.
Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the zero-APC Open Access (CC_BY) journal Beilstein of Journal of Nanotechnology, NanoE-Tox (doi:10.3762/bjnano.6.183).

The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:10.3762/bjnano.6.165) and thus with ontology annotation (see doi:10.1186/s13326-015-0005-5):

  recognizedToxicities = [
    "EC10": "http://www.bioassayontology.org/bao#BAO_0001263",
    "EC20": "http://www.bioassayontology.org/bao#BAO_0001235",
    "EC25": "http://www.bioassayontology.org/bao#BAO_0001264",
    "EC30": "http://www.bioassayontology.org/bao#BAO_0000599",
    "EC50": "http://www.bioassayontology.org/bao#BAO_0000188",
    "EC80": "http://purl.enanomapper.org/onto/ENM_0000053",
    "EC90": "http://www.bioassayontology.org/bao#BAO_0001237",
    "IC50": "http://www.bioassayontology.org/bao#BAO_0000190",
    "LC50": "http://www.bioassayontology.org/bao#BAO_0002145",
    "MIC":  "http://www.bioassayontology.org/bao#BAO_0002146",
    "NOEC": "http://purl.enanomapper.org/onto/ENM_0000060",
    "NOEL": "http://purl.enanomapper.org/onto/ENM_0000056"
  ]  

With 402(!) variants left. Many do not have an ontology term yet, and I filed a feature request.

Units:

  recognizedUnits = [
    "g/L": "g/L",
    "g/l": "g/l",
    "mg/L": "mg/L",
    "mg/ml": "mg/ml",
    "mg/mL": "mg/mL",
    "µg/L of food": "µg/L",
    "µg/L": "µg/L",
    "µg/mL": "µg/mL",
    "mg Ag/L": "mg/L",
    "mg Cu/L": "mg/L",
    "mg Zn/L": "mg/L",
    "µg dissolved Cu/L": "µg/L",
    "µg dissolved Zn/L": "µg/L",
    "µg Ag/L": "µg/L",
    "fmol/L": "fmol/L",
    
    "mmol/g": "mmol/g",
    "nmol/g fresh weight": "nmol/g",
    "µg Cu/g": "µg/g",
    "mg Ag/kg": "mg/kg",
    "mg Zn/kg": "mg/kg",
    "mg Zn/kg  d.w.": "mg/kg",
    "mg/kg of dry feed": "mg/kg", 
    "mg/kg": "mg/kg",
    "g/kg": "g/kg",
    "µg/g dry weight sediment": "µg/g", 
    "µg/g": "µg/g"
  ]

Oh, and don't get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:

  prop = prop.replace(",", ".")
  if (prop.substring(1).contains("-")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("±")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("<")) {
  } else {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${ssoNS}has-value", prop,
      "${xsdNS}double"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  }

But let me make clear: I can actually do this, add more data to the eNanoMapper database (with Nina), only because the developers of this database made their data available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They already did a lot of data curation. But only because of Open licenses, we only have to do this once.

Saturday, September 15, 2018

Wikidata Query Service recipe: qualifiers and the Greek alphabet

Just because I need to look this up each time myself, I wrote up this quick recipe for how to get information from statement qualifiers from Wikidata. Let's say, I want to list all Greek letters, with in one column the lower case and in the other the upper case letter. This is what our data looks like:


So, let start with a simple query that lists all letters in the Greek alphabet:

SELECT ?letter WHERE {
  ?letter wdt:P361 wd:Q8216 .
}

Of course, that only gives me the Wikidata entries, and not the Unicode characters we are after. So, let's add that Unicode character property:

SELECT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
}

Ah, that gets us somewhere:



But you see that the upper and lower case are still in separate rows, rather than columns. To fix that, we need access to those qualifiers. It's all in there in the Wikidata RDF, but the model is giving people a headache (so do many things, like math, but that does not mean we should stop doing it!). It all comes down to keeping notebooks, write down your tricks, etc. It's called the scientific method (there is more to that, than just keeping notebooks, tho).

Qualifiers
So, a lot of important information is put in qualifiers, and not just the statements. Let's first get all statements for a Greek letter. We would do that with:

?letter ?pprop ?statement .

One thing we want to know about the property we're looking at, is the entity linked to that. We do that by adding this bit:

?property wikibase:claim ?propp .

Of course, the property we are interested in is the Unicode character, so can put that directly in:

wd:P487 wikibase:claim ?propp .

Next, the qualifiers for the statement. We want them all:

?statement ?qualifier ?qualifierVal .
?qualifierProp wikibase:qualifier ?qualifier .

And because we do not want any qualifier but the applies to part, we can put that in too:

?statement ?qualifier ?qualifierVal .
wd:P518 wikibase:qualifier ?qualifier .

Furthermore, we are only interested in lower case and upper case, and we can put that in as well (for upper case):

?statement ?qualifier wd:Q98912 .
wd:P518 wikibase:qualifier ?qualifier .

So, if we want both upper and lower case, we now get this full query:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp .
  ?statement ?qualifier wd:Q8185162 .
  wd:P518 wikibase:qualifier ?qualifier .
}

We are not done yet, because you can see in the above example that we get the unicode character differently from the statement. This needs to be integrated, and we need the wikibase:statementProperty for that:

wd:P487 wikibase:statementProperty ?statementProp .
?statement ?statementProp ?unicode .

If we integrate that, we get this query, which is indeed getting complex:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier wd:Q8185162 ;
             ?statementProp ?unicode .  
  wd:P518 wikibase:qualifier ?qualifier .
}

But basically we have our template here, with three parameters:
  1. the property of the statement (here P487: Unicode character)
  2. the property of the qualifier (here P518: applies to part)
  3. the object value of the qualifier (here Q98912: upper case)
If we use the SPARQL VALUES approach, we get the following template. Notice that I renamed the variables of ?letter and ?unicode. But I left the wdt:P361 wd:Q8216 (='part of' 'Greek alphabet') in, so that this query does not time out:

SELECT DISTINCT ?entityOfInterest ?statementDataValue WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 . # 'part of' 'Greek alphabet'
  VALUES ?qualifierObject { wd:Q8185162 }
  VALUES ?qualifierProperty { wd:P518 }
  VALUES ?statementProperty { wd:P487 }

  # template
  ?entityOfInterest ?pprop ?statement .
  ?statementProperty wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier ?qualifierObject ;
             ?statementProp ?statementDataValue .  
  ?qualifierProperty wikibase:qualifier ?qualifier .
}

So, there is our recipe, for everyone to copy/paste.

Completing the Greek alphabet example
OK, now since I actually started with the upper and lower case Unicode character for Greek letters, let's finish that query too. Since we need both, we need to use the template twice:

SELECT DISTINCT ?entityOfInterest ?lowerCase ?upperCase WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
}

Still one issue left to fix. Some greek letters have more than one upper case Unicode character. We need to concatenate those. That requires a GROUP BY and the GROUP_CONCAT function, and get this query:

SELECT DISTINCT ?entityOfInterest
  (GROUP_CONCAT(DISTINCT ?lowerCase; separator=", ") AS ?lowerCases)
  (GROUP_CONCAT(DISTINCT ?upperCase; separator=", ") AS ?upperCases)
WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
} GROUP BY ?entityOfInterest

Now, since most of my blog posts are not just fun, but typically also have a use case, allow me to shed light on the context. Since you are still reading, your officially part of the secret society of brave followers of my blog. Tweet to my egonwillighagen account a message consisting of a series of letters followed by two numbers (no spaces) and another series of letters, where the two numbers indicate the number of letters at the start and the end, for example, abc32yz or adasgfshjdg111x, and I will you add you to my secret list of brave followers (and I will like the tweet; if you disguise the string to suggest it has some meaning, I will also retweet it). Only that string is allowed and don't tell anyone what it is about, or I will remove you from the list again :) Anyway, my ambition is to make a Wikidata-based BINAS replacement.

So, we only have a human readable name. The frequently used SERVICE wikibase:label does a pretty decent job and we end up with this table:


Sunday, September 09, 2018

cOAlition S with their Plan S

Screenshot of the Scholia page for Plan S.
Last Monday the bomb dropped: eleven European funders (with likely more to follow) indicate that they are not going to support journals that are not fully Open Access, i.e. fully are partially paywalled journals: cOAlition S announced Plan S.

There is a lot of news about this and a lot of discussion: many agree that it is at least an interesting step. Some have argued that the plan advocates a commercial publishing model and that it accepts glossy journals. Indeed, it does not specifically address those points, but other news also suggests that this is Plan S is not the only step funders are undertaking: for example, the Dutch NWO is also putting serious effort in fighting the use of the flawed impact factor.

One thing I particularly like about this Plan S is that it counters joined efforts from our universities (via the VSNU) with their big package deals that currently favor hybrid journals over full Open Access journals. That is, I can publish my cheminformatics under a Creative Commons license for free in the hybrid JCIM, where I do not get similar funding for the full Open Access JCheminform.

Another aspect I like is that it insists on the three core values of Open Science, the rights to:

  1. reuse,
  2. modify, and
  3. share.
I cannot stress this enough. Only with these core values, we can build on earlier knowledge and earlier research. It is worth reading all ten principles.

Keeping up to date
We will see a lot of analyses of what will happen now. Things will have to further unfold. We will see that other funders will join, and we have seen that some funders did not join yet, because they were unsure if they could make the time line (like the Swedish VR). There are a few ways to keep updated. First, you can use the RSS feed of Scholia for both the Plan S and the cOAlition S (see the above screenshot). But this is mostly material with an DOI and not general news. Second, you could follow the oa.plan_s and oa.coalitions tags of the Open Access Tracking Project.

Bianca Kramer has used public data to make an initial assessment of the impact of the plan (full data):
Screenshot of the graph showing contributions or license types to literature for ten of the eleven
research funders. Plan S ensures that these bare become 100% pure gold (yellow).
It was noted (cannot find the tweet, right now...) that the amount of literature based on funding from cOAlition S is only 1-2% of all European output. That's not a lot, but keep in mind: 1. more funders will join, 2. a 1-2% extra pressure will make shareholders think, 3. the Plan S stops favoring hybrid journals over Open Access journals, and 4. the percentage extra submissions to full Open Access journals will be significantly higher compared to their current count.