Pages

Sunday, September 16, 2018

Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)

Slice of the spreadsheet in the supplementary info.
Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the zero-APC Open Access (CC_BY) journal Beilstein of Journal of Nanotechnology, NanoE-Tox (doi:10.3762/bjnano.6.183).

The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:10.3762/bjnano.6.165) and thus with ontology annotation (see doi:10.1186/s13326-015-0005-5):

  recognizedToxicities = [
    "EC10": "http://www.bioassayontology.org/bao#BAO_0001263",
    "EC20": "http://www.bioassayontology.org/bao#BAO_0001235",
    "EC25": "http://www.bioassayontology.org/bao#BAO_0001264",
    "EC30": "http://www.bioassayontology.org/bao#BAO_0000599",
    "EC50": "http://www.bioassayontology.org/bao#BAO_0000188",
    "EC80": "http://purl.enanomapper.org/onto/ENM_0000053",
    "EC90": "http://www.bioassayontology.org/bao#BAO_0001237",
    "IC50": "http://www.bioassayontology.org/bao#BAO_0000190",
    "LC50": "http://www.bioassayontology.org/bao#BAO_0002145",
    "MIC":  "http://www.bioassayontology.org/bao#BAO_0002146",
    "NOEC": "http://purl.enanomapper.org/onto/ENM_0000060",
    "NOEL": "http://purl.enanomapper.org/onto/ENM_0000056"
  ]  

With 402(!) variants left. Many do not have an ontology term yet, and I filed a feature request.

Units:

  recognizedUnits = [
    "g/L": "g/L",
    "g/l": "g/l",
    "mg/L": "mg/L",
    "mg/ml": "mg/ml",
    "mg/mL": "mg/mL",
    "µg/L of food": "µg/L",
    "µg/L": "µg/L",
    "µg/mL": "µg/mL",
    "mg Ag/L": "mg/L",
    "mg Cu/L": "mg/L",
    "mg Zn/L": "mg/L",
    "µg dissolved Cu/L": "µg/L",
    "µg dissolved Zn/L": "µg/L",
    "µg Ag/L": "µg/L",
    "fmol/L": "fmol/L",
    
    "mmol/g": "mmol/g",
    "nmol/g fresh weight": "nmol/g",
    "µg Cu/g": "µg/g",
    "mg Ag/kg": "mg/kg",
    "mg Zn/kg": "mg/kg",
    "mg Zn/kg  d.w.": "mg/kg",
    "mg/kg of dry feed": "mg/kg", 
    "mg/kg": "mg/kg",
    "g/kg": "g/kg",
    "µg/g dry weight sediment": "µg/g", 
    "µg/g": "µg/g"
  ]

Oh, and don't get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:

  prop = prop.replace(",", ".")
  if (prop.substring(1).contains("-")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("±")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("<")) {
  } else {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${ssoNS}has-value", prop,
      "${xsdNS}double"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  }

But let me make clear: I can actually do this, add more data to the eNanoMapper database (with Nina), only because the developers of this database made their data available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They already did a lot of data curation. But only because of Open licenses, we only have to do this once.

Saturday, September 15, 2018

Wikidata Query Service recipe: qualifiers and the Greek alphabet

Just because I need to look this up each time myself, I wrote up this quick recipe for how to get information from statement qualifiers from Wikidata. Let's say, I want to list all Greek letters, with in one column the lower case and in the other the upper case letter. This is what our data looks like:


So, let start with a simple query that lists all letters in the Greek alphabet:

SELECT ?letter WHERE {
  ?letter wdt:P361 wd:Q8216 .
}

Of course, that only gives me the Wikidata entries, and not the Unicode characters we are after. So, let's add that Unicode character property:

SELECT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
}

Ah, that gets us somewhere:



But you see that the upper and lower case are still in separate rows, rather than columns. To fix that, we need access to those qualifiers. It's all in there in the Wikidata RDF, but the model is giving people a headache (so do many things, like math, but that does not mean we should stop doing it!). It all comes down to keeping notebooks, write down your tricks, etc. It's called the scientific method (there is more to that, than just keeping notebooks, tho).

Qualifiers
So, a lot of important information is put in qualifiers, and not just the statements. Let's first get all statements for a Greek letter. We would do that with:

?letter ?pprop ?statement .

One thing we want to know about the property we're looking at, is the entity linked to that. We do that by adding this bit:

?property wikibase:claim ?propp .

Of course, the property we are interested in is the Unicode character, so can put that directly in:

wd:P487 wikibase:claim ?propp .

Next, the qualifiers for the statement. We want them all:

?statement ?qualifier ?qualifierVal .
?qualifierProp wikibase:qualifier ?qualifier .

And because we do not want any qualifier but the applies to part, we can put that in too:

?statement ?qualifier ?qualifierVal .
wd:P518 wikibase:qualifier ?qualifier .

Furthermore, we are only interested in lower case and upper case, and we can put that in as well (for upper case):

?statement ?qualifier wd:Q98912 .
wd:P518 wikibase:qualifier ?qualifier .

So, if we want both upper and lower case, we now get this full query:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp .
  ?statement ?qualifier wd:Q8185162 .
  wd:P518 wikibase:qualifier ?qualifier .
}

We are not done yet, because you can see in the above example that we get the unicode character differently from the statement. This needs to be integrated, and we need the wikibase:statementProperty for that:

wd:P487 wikibase:statementProperty ?statementProp .
?statement ?statementProp ?unicode .

If we integrate that, we get this query, which is indeed getting complex:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier wd:Q8185162 ;
             ?statementProp ?unicode .  
  wd:P518 wikibase:qualifier ?qualifier .
}

But basically we have our template here, with three parameters:
  1. the property of the statement (here P487: Unicode character)
  2. the property of the qualifier (here P518: applies to part)
  3. the object value of the qualifier (here Q98912: upper case)
If we use the SPARQL VALUES approach, we get the following template. Notice that I renamed the variables of ?letter and ?unicode. But I left the wdt:P361 wd:Q8216 (='part of' 'Greek alphabet') in, so that this query does not time out:

SELECT DISTINCT ?entityOfInterest ?statementDataValue WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 . # 'part of' 'Greek alphabet'
  VALUES ?qualifierObject { wd:Q8185162 }
  VALUES ?qualifierProperty { wd:P518 }
  VALUES ?statementProperty { wd:P487 }

  # template
  ?entityOfInterest ?pprop ?statement .
  ?statementProperty wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier ?qualifierObject ;
             ?statementProp ?statementDataValue .  
  ?qualifierProperty wikibase:qualifier ?qualifier .
}

So, there is our recipe, for everyone to copy/paste.

Completing the Greek alphabet example
OK, now since I actually started with the upper and lower case Unicode character for Greek letters, let's finish that query too. Since we need both, we need to use the template twice:

SELECT DISTINCT ?entityOfInterest ?lowerCase ?upperCase WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
}

Still one issue left to fix. Some greek letters have more than one upper case Unicode character. We need to concatenate those. That requires a GROUP BY and the GROUP_CONCAT function, and get this query:

SELECT DISTINCT ?entityOfInterest
  (GROUP_CONCAT(DISTINCT ?lowerCase; separator=", ") AS ?lowerCases)
  (GROUP_CONCAT(DISTINCT ?upperCase; separator=", ") AS ?upperCases)
WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
} GROUP BY ?entityOfInterest

Now, since most of my blog posts are not just fun, but typically also have a use case, allow me to shed light on the context. Since you are still reading, your officially part of the secret society of brave followers of my blog. Tweet to my egonwillighagen account a message consisting of a series of letters followed by two numbers (no spaces) and another series of letters, where the two numbers indicate the number of letters at the start and the end, for example, abc32yz or adasgfshjdg111x, and I will you add you to my secret list of brave followers (and I will like the tweet; if you disguise the string to suggest it has some meaning, I will also retweet it). Only that string is allowed and don't tell anyone what it is about, or I will remove you from the list again :) Anyway, my ambition is to make a Wikidata-based BINAS replacement.

So, we only have a human readable name. The frequently used SERVICE wikibase:label does a pretty decent job and we end up with this table:


Sunday, September 09, 2018

cOAlition S with their Plan S

Screenshot of the Scholia page for Plan S.
Last Monday the bomb dropped: eleven European funders (with likely more to follow) indicate that they are not going to support journals that are not fully Open Access, i.e. fully are partially paywalled journals: cOAlition S announced Plan S.

There is a lot of news about this and a lot of discussion: many agree that it is at least an interesting step. Some have argued that the plan advocates a commercial publishing model and that it accepts glossy journals. Indeed, it does not specifically address those points, but other news also suggests that this is Plan S is not the only step funders are undertaking: for example, the Dutch NWO is also putting serious effort in fighting the use of the flawed impact factor.

One thing I particularly like about this Plan S is that it counters joined efforts from our universities (via the VSNU) with their big package deals that currently favor hybrid journals over full Open Access journals. That is, I can publish my cheminformatics under a Creative Commons license for free in the hybrid JCIM, where I do not get similar funding for the full Open Access JCheminform.

Another aspect I like is that it insists on the three core values of Open Science, the rights to:

  1. reuse,
  2. modify, and
  3. share.
I cannot stress this enough. Only with these core values, we can build on earlier knowledge and earlier research. It is worth reading all ten principles.

Keeping up to date
We will see a lot of analyses of what will happen now. Things will have to further unfold. We will see that other funders will join, and we have seen that some funders did not join yet, because they were unsure if they could make the time line (like the Swedish VR). There are a few ways to keep updated. First, you can use the RSS feed of Scholia for both the Plan S and the cOAlition S (see the above screenshot). But this is mostly material with an DOI and not general news. Second, you could follow the oa.plan_s and oa.coalitions tags of the Open Access Tracking Project.

Bianca Kramer has used public data to make an initial assessment of the impact of the plan (full data):
Screenshot of the graph showing contributions or license types to literature for ten of the eleven
research funders. Plan S ensures that these bare become 100% pure gold (yellow).
It was noted (cannot find the tweet, right now...) that the amount of literature based on funding from cOAlition S is only 1-2% of all European output. That's not a lot, but keep in mind: 1. more funders will join, 2. a 1-2% extra pressure will make shareholders think, 3. the Plan S stops favoring hybrid journals over Open Access journals, and 4. the percentage extra submissions to full Open Access journals will be significantly higher compared to their current count.

Saturday, September 08, 2018

Also new this week: "Google Dataset Search"

There was a lot of Open Science news this week. The announcement of the Google Dataset Search was one of them:


 Of course, I first tried searching for "RDF chemistry" which shows some of my data sets (and a lot more):


It picks up data from many sources, such as Figshare in this image. That means it also works (well, sort of, as Noel O'Boyle noticed) for supplementary information from the Journal of Cheminformatics.

It picks up metadata in several ways, among which schemas.org. So, next week we'll see if we can get eNanoMapper extended to spit compatible JSON-LD for its data sets, called "bundles".

Integrated with Google Scholar?
While the URL for the search engine does not suggest the service is more than a 20% project, we can hope it will stay around like Google Scholar has been. But I do hope they will further integrate it with Scholar. For example, in the above figure, it did pick up that I am the author of that data set (well, repurposed from an effort of Rich Apodaca), it did not figure out that I am also on Scholar.

So, these data sets do not show up in your Google Scholar profile yet, but they must. Time will tell where this data search engine is going. There are many interesting features, and given the amount of online attention, they won't stop development just yet, and I expect to discover more and better features in the next months. Give it a spin!

Mastodon: somewhere between Twitter and FriendFeed

Now I forgot who told me about it (sorry!), but I started looking at Mastodon last week. Mastodon is like Twitter or Whatsapp, but then distributed and federated. And that has advantage: no vendor lock-in, servers for specific communities. In that sense, it's more like email, with neither one point-of-failure, but potentially many points-of-failure.

But Mastodon is also very well done. In the past week I set up two accounts, one of a general server and one on a community server (more about that in a second). I am still learning, but want to share some observations.

First, the platform is not unique and there are other, maybe better) distributed and federated software solutions, but Mastodon is slick. This is what my mastodon.social profile page looks like, but you can change the theme if you like. So far, pretty standard:

My @egonw@mastodon.social profile page.
Multiple accounts
While I am still exploring this bit, you can have multiple accounts. I am not entire sure yet how to link them up, but currently they follow each other. My second account is on a server aimed at scholars and stuff that scholars talk about. This distributed features is advertised as follows: sometimes you want to talk science, something you want to talk movies. The last I would do on my mastodon.social account and the science on my scholar.social account.

However, essential is that you can follow anyone on any server: you do not have to be on scholar.social to follow my toots there. (Of course, you can also simply check the profile page, and you can read my public toots without any Mastodon account.)

This topic server is a really exciting idea. This provides an alternative to mailing lists or slack rooms. And each server decides on their own community norms, and any account can be blocked of violating those community norms. No dependency on Twitter or Facebook to decide what is right or wrong, the community can do that themselves.

BioMedCentral could host one server for each journal... now that's an idea :)

Controlling what you see
OK, let's look at a single toot, here about a recent JACS paper (doi:10.1021/jacs.8b03913):


Each toot (like tweet) has replies, boosts (like retweets), and favorites (like likes). Now, I currently follow this anonymous account. You can do the normal things, like follow, send direct messages, mute, and block accounts:


You can follow toots from just the people you follow, but also follow all tweets on that particular server (which makes sense of you have a server about a very specific topic), or toots on all federated servers (unwise).

The intention of Mastodon is to give the users a lot of control. You should not expect non-linear timelines or promoted toots. If that is your things, better stay on Twitter. An example of the level of control is what options it offers me for my "Notifications" timeline:


Other stuff I like
Some random things that I noticed: there is more room for detail and you have 500 chars, URLs are not shortened (each URL counts as 20 chars), animated GIFs are animated when I hover over them. Cool, no need to pause them! You cannot edit toots, but at least I found a "Delete and redraft" option. Mastodon has solutions for hiding sensitive material. That causes to part of the toot to be hidden by default. This can be used to hide content that may upset people, like medical images of intestines :) The CW is short for Content Warning and is used for that.

There is a lot more, but I'm running out of time for writing this blog post. Check out this useful An Increasingly Less-Brief Guide to Mastodon.

So, who to follow?
Well, I found two options. One is, use Wikidata, where you can search for authors with a (one or more) Mastodon accounts. For example, try this query to find accounts for Journal of Cheminformatics authors:

Yes, that list is not very long yet:


But given the design and implementation of Mastodon, this could change quickly.

FriendFeed??
Well, some of you are old enough to remember FriendFeed. The default interface is different, but if I open a single toot in a separate page, it does remind me a lot of FriendFeed, and I am wondering in Mastodon can be that FriendFeed replacement we have long waited for! What do you think?

Saturday, September 01, 2018

Biological stories: give me a reason to make time for interactive biological pathways

I am overbooked. There is a lot of things I want to do that (I believe) will make science better. But I don't have the time, nor enough funding to hire enough people to help me out (funders: hint, hint). Some things are not linked to project deliverables, and are things I have to do in my free time. And I don't have much of that left, really, but hacking on scientific knowledge relaxes me and give me energy.

Interactive biological stories
Inspider by Proteopedia (something you must check out, if you do not already know it), Jacob Windsor did an internship to bring this idea to WikiPathways: check his Google Summer of Code project. I want to actually put this to use, but, as outlined above, not being part of paid deliverables, it's hard to find time for it.

Thus, I need a reason. And one reason could be: impact. After all, if my work has impact on the scientific community, that helps me keep doing my research. So, when eLife tweeted an interesting study on digoxin, I realized that if enough people would be interested in such a high profile story, an interactive pathway may have enough impact for me to free up time. Hence, I asked on Twitter:


But as you can see, there not just enough RTs yet. So, because I really want something fun to do, please do give me no excuse to not do this (but write grant proposals instead), and retweet this tweet. Thanks!