Pages

Sunday, September 16, 2018

Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)

Slice of the spreadsheet in the supplementary info.
Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the zero-APC Open Access (CC_BY) journal Beilstein of Journal of Nanotechnology, NanoE-Tox (doi:10.3762/bjnano.6.183).

The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:10.3762/bjnano.6.165) and thus with ontology annotation (see doi:10.1186/s13326-015-0005-5):

  recognizedToxicities = [
    "EC10": "http://www.bioassayontology.org/bao#BAO_0001263",
    "EC20": "http://www.bioassayontology.org/bao#BAO_0001235",
    "EC25": "http://www.bioassayontology.org/bao#BAO_0001264",
    "EC30": "http://www.bioassayontology.org/bao#BAO_0000599",
    "EC50": "http://www.bioassayontology.org/bao#BAO_0000188",
    "EC80": "http://purl.enanomapper.org/onto/ENM_0000053",
    "EC90": "http://www.bioassayontology.org/bao#BAO_0001237",
    "IC50": "http://www.bioassayontology.org/bao#BAO_0000190",
    "LC50": "http://www.bioassayontology.org/bao#BAO_0002145",
    "MIC":  "http://www.bioassayontology.org/bao#BAO_0002146",
    "NOEC": "http://purl.enanomapper.org/onto/ENM_0000060",
    "NOEL": "http://purl.enanomapper.org/onto/ENM_0000056"
  ]  

With 402(!) variants left. Many do not have an ontology term yet, and I filed a feature request.

Units:

  recognizedUnits = [
    "g/L": "g/L",
    "g/l": "g/l",
    "mg/L": "mg/L",
    "mg/ml": "mg/ml",
    "mg/mL": "mg/mL",
    "µg/L of food": "µg/L",
    "µg/L": "µg/L",
    "µg/mL": "µg/mL",
    "mg Ag/L": "mg/L",
    "mg Cu/L": "mg/L",
    "mg Zn/L": "mg/L",
    "µg dissolved Cu/L": "µg/L",
    "µg dissolved Zn/L": "µg/L",
    "µg Ag/L": "µg/L",
    "fmol/L": "fmol/L",
    
    "mmol/g": "mmol/g",
    "nmol/g fresh weight": "nmol/g",
    "µg Cu/g": "µg/g",
    "mg Ag/kg": "mg/kg",
    "mg Zn/kg": "mg/kg",
    "mg Zn/kg  d.w.": "mg/kg",
    "mg/kg of dry feed": "mg/kg", 
    "mg/kg": "mg/kg",
    "g/kg": "g/kg",
    "µg/g dry weight sediment": "µg/g", 
    "µg/g": "µg/g"
  ]

Oh, and don't get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:

  prop = prop.replace(",", ".")
  if (prop.substring(1).contains("-")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("±")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("<")) {
  } else {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${ssoNS}has-value", prop,
      "${xsdNS}double"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  }

But let me make clear: I can actually do this, add more data to the eNanoMapper database (with Nina), only because the developers of this database made their data available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They already did a lot of data curation. But only because of Open licenses, we only have to do this once.

Saturday, September 15, 2018

Wikidata Query Service recipe: qualifiers and the Greek alphabet

Just because I need to look this up each time myself, I wrote up this quick recipe for how to get information from statement qualifiers from Wikidata. Let's say, I want to list all Greek letters, with in one column the lower case and in the other the upper case letter. This is what our data looks like:


So, let start with a simple query that lists all letters in the Greek alphabet:

SELECT ?letter WHERE {
  ?letter wdt:P361 wd:Q8216 .
}

Of course, that only gives me the Wikidata entries, and not the Unicode characters we are after. So, let's add that Unicode character property:

SELECT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
}

Ah, that gets us somewhere:



But you see that the upper and lower case are still in separate rows, rather than columns. To fix that, we need access to those qualifiers. It's all in there in the Wikidata RDF, but the model is giving people a headache (so do many things, like math, but that does not mean we should stop doing it!). It all comes down to keeping notebooks, write down your tricks, etc. It's called the scientific method (there is more to that, than just keeping notebooks, tho).

Qualifiers
So, a lot of important information is put in qualifiers, and not just the statements. Let's first get all statements for a Greek letter. We would do that with:

?letter ?pprop ?statement .

One thing we want to know about the property we're looking at, is the entity linked to that. We do that by adding this bit:

?property wikibase:claim ?propp .

Of course, the property we are interested in is the Unicode character, so can put that directly in:

wd:P487 wikibase:claim ?propp .

Next, the qualifiers for the statement. We want them all:

?statement ?qualifier ?qualifierVal .
?qualifierProp wikibase:qualifier ?qualifier .

And because we do not want any qualifier but the applies to part, we can put that in too:

?statement ?qualifier ?qualifierVal .
wd:P518 wikibase:qualifier ?qualifier .

Furthermore, we are only interested in lower case and upper case, and we can put that in as well (for upper case):

?statement ?qualifier wd:Q98912 .
wd:P518 wikibase:qualifier ?qualifier .

So, if we want both upper and lower case, we now get this full query:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 ;
          wdt:P487 ?unicode .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp .
  ?statement ?qualifier wd:Q8185162 .
  wd:P518 wikibase:qualifier ?qualifier .
}

We are not done yet, because you can see in the above example that we get the unicode character differently from the statement. This needs to be integrated, and we need the wikibase:statementProperty for that:

wd:P487 wikibase:statementProperty ?statementProp .
?statement ?statementProp ?unicode .

If we integrate that, we get this query, which is indeed getting complex:

SELECT DISTINCT ?letter ?unicode WHERE {
  ?letter wdt:P361 wd:Q8216 .
  ?letter ?pprop ?statement .
  wd:P487 wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier wd:Q8185162 ;
             ?statementProp ?unicode .  
  wd:P518 wikibase:qualifier ?qualifier .
}

But basically we have our template here, with three parameters:
  1. the property of the statement (here P487: Unicode character)
  2. the property of the qualifier (here P518: applies to part)
  3. the object value of the qualifier (here Q98912: upper case)
If we use the SPARQL VALUES approach, we get the following template. Notice that I renamed the variables of ?letter and ?unicode. But I left the wdt:P361 wd:Q8216 (='part of' 'Greek alphabet') in, so that this query does not time out:

SELECT DISTINCT ?entityOfInterest ?statementDataValue WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 . # 'part of' 'Greek alphabet'
  VALUES ?qualifierObject { wd:Q8185162 }
  VALUES ?qualifierProperty { wd:P518 }
  VALUES ?statementProperty { wd:P487 }

  # template
  ?entityOfInterest ?pprop ?statement .
  ?statementProperty wikibase:claim ?propp ;
          wikibase:statementProperty ?statementProp .
  ?statement ?qualifier ?qualifierObject ;
             ?statementProp ?statementDataValue .  
  ?qualifierProperty wikibase:qualifier ?qualifier .
}

So, there is our recipe, for everyone to copy/paste.

Completing the Greek alphabet example
OK, now since I actually started with the upper and lower case Unicode character for Greek letters, let's finish that query too. Since we need both, we need to use the template twice:

SELECT DISTINCT ?entityOfInterest ?lowerCase ?upperCase WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
}

Still one issue left to fix. Some greek letters have more than one upper case Unicode character. We need to concatenate those. That requires a GROUP BY and the GROUP_CONCAT function, and get this query:

SELECT DISTINCT ?entityOfInterest
  (GROUP_CONCAT(DISTINCT ?lowerCase; separator=", ") AS ?lowerCases)
  (GROUP_CONCAT(DISTINCT ?upperCase; separator=", ") AS ?upperCases)
WHERE {
  ?entityOfInterest wdt:P361 wd:Q8216 .

  { # lower case
    ?entityOfInterest ?pprop ?statement .
    wd:P487 wikibase:claim ?propp ;
            wikibase:statementProperty ?statementProp .
    ?statement ?qualifier wd:Q8185162 ;
               ?statementProp ?lowerCase .  
    wd:P518 wikibase:qualifier ?qualifier .
  }

  { # upper case
    ?entityOfInterest ?pprop2 ?statement2 .
    wd:P487 wikibase:claim ?propp2 ;
            wikibase:statementProperty ?statementProp2 .
    ?statement2 ?qualifier2 wd:Q98912 ;
               ?statementProp2 ?upperCase .  
    wd:P518 wikibase:qualifier ?qualifier2 .
  }
} GROUP BY ?entityOfInterest

Now, since most of my blog posts are not just fun, but typically also have a use case, allow me to shed light on the context. Since you are still reading, your officially part of the secret society of brave followers of my blog. Tweet to my egonwillighagen account a message consisting of a series of letters followed by two numbers (no spaces) and another series of letters, where the two numbers indicate the number of letters at the start and the end, for example, abc32yz or adasgfshjdg111x, and I will you add you to my secret list of brave followers (and I will like the tweet; if you disguise the string to suggest it has some meaning, I will also retweet it). Only that string is allowed and don't tell anyone what it is about, or I will remove you from the list again :) Anyway, my ambition is to make a Wikidata-based BINAS replacement.

So, we only have a human readable name. The frequently used SERVICE wikibase:label does a pretty decent job and we end up with this table:


Sunday, September 09, 2018

cOAlition S with their Plan S

Screenshot of the Scholia page for Plan S.
Last Monday the bomb dropped: eleven European funders (with likely more to follow) indicate that they are not going to support journals that are not fully Open Access, i.e. fully are partially paywalled journals: cOAlition S announced Plan S.

There is a lot of news about this and a lot of discussion: many agree that it is at least an interesting step. Some have argued that the plan advocates a commercial publishing model and that it accepts glossy journals. Indeed, it does not specifically address those points, but other news also suggests that this is Plan S is not the only step funders are undertaking: for example, the Dutch NWO is also putting serious effort in fighting the use of the flawed impact factor.

One thing I particularly like about this Plan S is that it counters joined efforts from our universities (via the VSNU) with their big package deals that currently favor hybrid journals over full Open Access journals. That is, I can publish my cheminformatics under a Creative Commons license for free in the hybrid JCIM, where I do not get similar funding for the full Open Access JCheminform.

Another aspect I like is that it insists on the three core values of Open Science, the rights to:

  1. reuse,
  2. modify, and
  3. share.
I cannot stress this enough. Only with these core values, we can build on earlier knowledge and earlier research. It is worth reading all ten principles.

Keeping up to date
We will see a lot of analyses of what will happen now. Things will have to further unfold. We will see that other funders will join, and we have seen that some funders did not join yet, because they were unsure if they could make the time line (like the Swedish VR). There are a few ways to keep updated. First, you can use the RSS feed of Scholia for both the Plan S and the cOAlition S (see the above screenshot). But this is mostly material with an DOI and not general news. Second, you could follow the oa.plan_s and oa.coalitions tags of the Open Access Tracking Project.

Bianca Kramer has used public data to make an initial assessment of the impact of the plan (full data):
Screenshot of the graph showing contributions or license types to literature for ten of the eleven
research funders. Plan S ensures that these bare become 100% pure gold (yellow).
It was noted (cannot find the tweet, right now...) that the amount of literature based on funding from cOAlition S is only 1-2% of all European output. That's not a lot, but keep in mind: 1. more funders will join, 2. a 1-2% extra pressure will make shareholders think, 3. the Plan S stops favoring hybrid journals over Open Access journals, and 4. the percentage extra submissions to full Open Access journals will be significantly higher compared to their current count.

Saturday, September 08, 2018

Also new this week: "Google Dataset Search"

There was a lot of Open Science news this week. The announcement of the Google Dataset Search was one of them:


 Of course, I first tried searching for "RDF chemistry" which shows some of my data sets (and a lot more):


It picks up data from many sources, such as Figshare in this image. That means it also works (well, sort of, as Noel O'Boyle noticed) for supplementary information from the Journal of Cheminformatics.

It picks up metadata in several ways, among which schemas.org. So, next week we'll see if we can get eNanoMapper extended to spit compatible JSON-LD for its data sets, called "bundles".

Integrated with Google Scholar?
While the URL for the search engine does not suggest the service is more than a 20% project, we can hope it will stay around like Google Scholar has been. But I do hope they will further integrate it with Scholar. For example, in the above figure, it did pick up that I am the author of that data set (well, repurposed from an effort of Rich Apodaca), it did not figure out that I am also on Scholar.

So, these data sets do not show up in your Google Scholar profile yet, but they must. Time will tell where this data search engine is going. There are many interesting features, and given the amount of online attention, they won't stop development just yet, and I expect to discover more and better features in the next months. Give it a spin!

Mastodon: somewhere between Twitter and FriendFeed

Now I forgot who told me about it (sorry!), but I started looking at Mastodon last week. Mastodon is like Twitter or Whatsapp, but then distributed and federated. And that has advantage: no vendor lock-in, servers for specific communities. In that sense, it's more like email, with neither one point-of-failure, but potentially many points-of-failure.

But Mastodon is also very well done. In the past week I set up two accounts, one of a general server and one on a community server (more about that in a second). I am still learning, but want to share some observations.

First, the platform is not unique and there are other, maybe better) distributed and federated software solutions, but Mastodon is slick. This is what my mastodon.social profile page looks like, but you can change the theme if you like. So far, pretty standard:

My @egonw@mastodon.social profile page.
Multiple accounts
While I am still exploring this bit, you can have multiple accounts. I am not entire sure yet how to link them up, but currently they follow each other. My second account is on a server aimed at scholars and stuff that scholars talk about. This distributed features is advertised as follows: sometimes you want to talk science, something you want to talk movies. The last I would do on my mastodon.social account and the science on my scholar.social account.

However, essential is that you can follow anyone on any server: you do not have to be on scholar.social to follow my toots there. (Of course, you can also simply check the profile page, and you can read my public toots without any Mastodon account.)

This topic server is a really exciting idea. This provides an alternative to mailing lists or slack rooms. And each server decides on their own community norms, and any account can be blocked of violating those community norms. No dependency on Twitter or Facebook to decide what is right or wrong, the community can do that themselves.

BioMedCentral could host one server for each journal... now that's an idea :)

Controlling what you see
OK, let's look at a single toot, here about a recent JACS paper (doi:10.1021/jacs.8b03913):


Each toot (like tweet) has replies, boosts (like retweets), and favorites (like likes). Now, I currently follow this anonymous account. You can do the normal things, like follow, send direct messages, mute, and block accounts:


You can follow toots from just the people you follow, but also follow all tweets on that particular server (which makes sense of you have a server about a very specific topic), or toots on all federated servers (unwise).

The intention of Mastodon is to give the users a lot of control. You should not expect non-linear timelines or promoted toots. If that is your things, better stay on Twitter. An example of the level of control is what options it offers me for my "Notifications" timeline:


Other stuff I like
Some random things that I noticed: there is more room for detail and you have 500 chars, URLs are not shortened (each URL counts as 20 chars), animated GIFs are animated when I hover over them. Cool, no need to pause them! You cannot edit toots, but at least I found a "Delete and redraft" option. Mastodon has solutions for hiding sensitive material. That causes to part of the toot to be hidden by default. This can be used to hide content that may upset people, like medical images of intestines :) The CW is short for Content Warning and is used for that.

There is a lot more, but I'm running out of time for writing this blog post. Check out this useful An Increasingly Less-Brief Guide to Mastodon.

So, who to follow?
Well, I found two options. One is, use Wikidata, where you can search for authors with a (one or more) Mastodon accounts. For example, try this query to find accounts for Journal of Cheminformatics authors:

Yes, that list is not very long yet:


But given the design and implementation of Mastodon, this could change quickly.

FriendFeed??
Well, some of you are old enough to remember FriendFeed. The default interface is different, but if I open a single toot in a separate page, it does remind me a lot of FriendFeed, and I am wondering in Mastodon can be that FriendFeed replacement we have long waited for! What do you think?

Saturday, September 01, 2018

Biological stories: give me a reason to make time for interactive biological pathways

I am overbooked. There is a lot of things I want to do that (I believe) will make science better. But I don't have the time, nor enough funding to hire enough people to help me out (funders: hint, hint). Some things are not linked to project deliverables, and are things I have to do in my free time. And I don't have much of that left, really, but hacking on scientific knowledge relaxes me and give me energy.

Interactive biological stories
Inspider by Proteopedia (something you must check out, if you do not already know it), Jacob Windsor did an internship to bring this idea to WikiPathways: check his Google Summer of Code project. I want to actually put this to use, but, as outlined above, not being part of paid deliverables, it's hard to find time for it.

Thus, I need a reason. And one reason could be: impact. After all, if my work has impact on the scientific community, that helps me keep doing my research. So, when eLife tweeted an interesting study on digoxin, I realized that if enough people would be interested in such a high profile story, an interactive pathway may have enough impact for me to free up time. Hence, I asked on Twitter:


But as you can see, there not just enough RTs yet. So, because I really want something fun to do, please do give me no excuse to not do this (but write grant proposals instead), and retweet this tweet. Thanks!

Saturday, August 18, 2018

Compound (class) identifiers in Wikidata

Bar chart showing the number of compounds
with a particular chemical identifier.
I think Wikidata is a groundbreaking project, which will have a major impact on science. One of the reasons is the open license (CCZero), the very basic approach (Wikibase), and the superb community around it. For example, setting up your own Wikibase including a cool SPARQL endpoint, is easily done with Docker.

Wikidata has many sub projects, such as WikiCite, which captures the collective of primary literature. Another one is the WikiProject Chemistry. The two nicely match up, I think, making a public database linking chemicals to literature (tho, very much needs to be done here), see my recent ICCS 2018 poster (doi:10.6084/m9.figshare.6356027.v1, paper pending).

But Wikidata is also a great resource for identifier mappings between chemical databases, something we need for our metabolism pathway research. The mapping, as you may know, are used in the latter via BridgeDb and we have been using Wikidata as one of three sources for some time now (the others being HMDB and ChEBI). WikiProject Chemistry has a related ChemID effort, and while the wiki page does not show much recent activity, there is actually a lot of ongoing effort (see plot). And I've been adding my bits.

Limitations of the links
But not each identifier in Wikidata has the same meaning. While they are all classified as 'external-id', the actual link may have different meaning. This, of course, is the essence of scientific lenses, see this post and the papers cited therein. One reason here is the difference in what entries in the various databases mean.

Wikidata has an extensive model, defined by the aforementioned WikiProject Chemistry. For example, it has different concepts for chemical compounds (in fact, the hierarchy is pretty rich) and compound classes. And these are differently modeled. Furthermore, it has a model that formalizes that things with a different InChI are different, but even allows things with the same InChI to be different, if need arises. It tries to accurately and precisely capture the certainty and uncertainty of the chemistry. As such, it is a powerful system to handle identifier mappings, because databases are not clear, and chemistry and biological in data is even less: we measure experimentally a characterization of chemicals, but what we put in databases and give names, are specific models (often chemical graphs).

That model differs from what other (chemical) databases use, or seem to use, because not always do databases indicate what they actually have in a record. But I think this is a fair guess.

ChEBI
ChEBI (and the matching ChEBI ID) has entries for chemical classes (e.g. fatty acid) and specific compounds (e.g. acetate).

PubChem, ChemSpider, UniChem
These three resources use the InChI as central asset. While they do not really have the concept of compound classes so much (though increasingly they have classifications), they do have entries where stereochemistry is undefined or unknown. Each one has their own way to link to other databases themselves, which normally includes tons of structure normalization (see e.g. doi:10.1186/s13321-018-0293-8 and doi:10.1186/s13321-015-0072-8)

HMDB
HMDB (and the matching P2057) has a biological perspective; the entries reflect the biology of a chemical. Therefore, for most compounds, they focus on the neutral forms of compounds. This makes linking to/from other databases where the compound is not neutral chemically less precise.

CAS registry numbers
CAS (and the matching P231) is pretty unique itself, and has identifiers for substances (see Q79529), much more than chemical compounds, and comes with a own set of unique features. For example, solutions of some compound, by design, have the same identifier. Previously, formaldehyde and formalin had different Wikipedia/Wikidata pages, both with the same CAS registry number.

Limitations of the links #2
Now, returning to our starting point: limitations in linking databases. If we want FAIR mappings, we need to be as precise as possible. Of course, that may mean we need more steps, but we can always simplify at will, but we never can have a computer make the links more complex (well, not without making assumptions, etc).

And that is why Wikidata is so suitable to link all these chemical databases: it can distinguish differences when needed, and make that explicit. It make mappings between the databases more FAIR.


Thursday, August 09, 2018

Alternative OpenAPIs around WikiPathways

I blogged in July about something I learned at a great Wikidata/ERC meeting in June: grlc. It's comparable to but different from the Open PHACTS API: it's a lot more general (and works with any SPARQL end point), but also does not have the identifier mapping service (based on BridgeDb) which we need to link the verious RDF data sets in Open PHACTS.

Of course, WikiPathways already has a OpenAPI and it's more powerful than we can do based on just the WikiPathways RDF (for various reasons), but the advantage is that you can expose any SPARQL query (see the examples at rdf.wikipathways.org) on the WikiPathways end point. As explained in July, you only have to set up a magic GitHub repository, and Chris suggested to show how this could be used to mimick some of the existing API methods.

The magic
The magic is defined in this GitHub repository, which currently exposes a single method:

#+ summary: Lists Organisms
#+ endpoint_in_url: False
#+ endpoint: http://sparql.wikipathways.org/
#+ tags:
#+   - Organism list

PREFIX rdfs:    
PREFIX wp: 

SELECT DISTINCT (str(?label) as ?organism)
WHERE {
    ?concept wp:organism ?organism ;
      wp:organismName ?label .
}

The result
I run grlc in the normal way and point it to egonw/wp-rdf-api and the result looks like:


And executing the method in this GUI (click the light blue bar of the method), results in a nice CSV reply:


Of course, because there is SPARQL behind each method, you can make any query you like, creating any OpenAPI methods that fit your data analysis workflow.

Wednesday, August 08, 2018

Green Open Access: increase your Open Access rate; and why stick with the PDF?

Icon of Unpaywall, a must have
browser extension for the modern
researcher.
Researchers of my generation (and earlier generations) have articles from the pre-Open Access era. Actually, I have even be tricked into closed access later; with a lot of pressure to publish as much as you can (which some see as a measure of your quality), it's impossible to not make an occasional misstep. But then there is Green Open Access (aka self-archiving), a concept I don't like, but is useful in those situations. One reason why I do not like it, is that there are many shades of green, and, yes, they all hurt: every journal has special rules. Fortunately, the brilliant SHERPA/RoMEO captures this.

Now, the second event that triggered this effort was my recent experience with Markdown (e.g. the eNanoMapper tutorials) and how platform like GitHub/GitLab built systems around it to publish this easily.

Why this matters to me? If I want to have my work have impact, I need people to be able to read my work. Open Access is one route. Of course, they can also email me for a copy the article, but I tend to be busy with getting new grants, supervision, etc. BTW, you can easily calculate your Open Access rate with ImpactStory, something you should try at least once in your life...

Step 1: identify which articles need an green Open Access version
Here, Unpaywall is the right tool, which does a brilliant job at identifying free versions. After all, one of your co-authors may already have self-archived it somewhere. So, yes, I do have a short list, one one of the papers was the second CDK paper (doi:10.2174/138161206777585274). The first CDK article was made CC-BY three years ago, with the ACS AuthorChoice program, but Current Pharmaceutical Design (CPD) does not have that option, as far as I know.

Step 2: check your author rights for green Open Access
The next step is to check SHERPA/RoMEO for your self-archiving rights. This is essential, as this is different for every journal; this is basically business model by obscurity, and without any standardization this is not FAIR in any way. For CDP it reports that I have quite a few rights (more than some bigger journals that still rely on Green to call themselves an "leading open access publisher", but also less than some others):

SHERPA/RoMEO report for CPD.
Many journals do not allow you to self-archive the post-print version. And that sucks, because a preprint is often quite similar, but just not the same deal (which is exactly what closed access publishers want). But being able to post the post-print version is brilliant, because few people actually even kept the last submitted version (again, exactly what closed access publishers want). This report also tells you where you can archive it, and that is not always the same either: it's not uncommon that self-archiving on something like Mendeley or Zotero is not allowed.

Step 3: a post-print version that is not the publisher PDF??
Ah, so you know what version of the article you can archive, and where. But we cannot archive the publisher PDF. So, no downloading of the PDF from the publisher website and putting that online.

Step 4: a custom PDF
Because in this case we are allowed to archive the post-print version, I am allowed to copy/paste the content from the publisher PDF. I can just create a new Word/LibreOffice document with that content, removing the publisher layout and publisher content, and make a new PDF of that. A decent PDF reader allows you to copy/paste large amounts of content in one go, and Linux/Win10 users can use pdfimages to extract the images from the PDF for reuse.

Step 5: why stick with the PDF?
But why would we stick with a PDF? Why not use something more machine readable? Something where that support syntax highlighting, downloading of table content as CSV, etc? And that made me think of my recent experiments with Markdown.

So, I started of with making a Markdown version of the second CDK paper.

In this process, I:

  1. removed hyphenation used to fit words/sentences nicely in PDF columns;
  2. wrapped the code sections for syntax highlighting
  3. recovered the images with pdfimages;
  4. converted the table content to CSV (and used Markdown Tables Generator to create Markdown content) and added "Download as CSV" links to the table captions;
  5. made the URLs clickable; and,
  6. added ORCID icons for the authors (where known).
Preview of the self-archived post-print of the second CDK article.
Step 6: tweet the free Green Open Access link
Of course, if no one knows about your effort, they cannot find your self-archived version. In due time, Google Scholar may pick it up, but I am not sure yet. Maybe (Bio)Schemas.org will help, but that is something I have yet to explore.

It's important to include the DOI URL in that link, so that the self-archived version will be linked to from services like Altmetric.com.


Next steps: get Unpaywall to know about your self-archived version
This is something I am actively exploring. When I know the steps to achieve this, I will report on that in this blog.

Saturday, August 04, 2018

WikiPathways Summit 2018

I was not there when WikiPathways was founded; I only joined in 2012 and I found my role in the area of metabolic pathways of this Open knowledge base (CC0, to be precise) of biological processes. This autumn, A WikiPathways Summit 2018 is organized in San Francisco to celebrate the 10th anniversary of the project, and everyone interested is kindly invited to join for three days of learning about WikiPathways, integrations, and use cases, data curation, and hacking on this great Open Science project.


Things that I would love to talk about (besides making metabolic pathways FAIR and Openly available) are the integrations with other platforms (Reactome, RaMP, MetaboLights, Pathway Commons, PubChem, Open PHACTS (using the RDF), etc, etc), Wikidata interoperability, and future interoperability with platoforms like AOPWiki, Open Targets, BRENDA, Europe PMC, etc, etc, etc.

Monday, July 09, 2018

Converting any SPARQL endpoint to an OpenAPI

Logo of the grlc project.
Sometimes you run into something awesome. I had that one or two months ago, when I found out about a cool project that can convert a random SPARQL endpoint into an OpenAPI endpoint: grlc. Now, May/June was really busy (well, the least few weeks before summer not much less so), but at the WikiProject Wikidata for research meeting in Berlin last month, I just had to give it a go.

There is a convenient Docker image, so setting it up was a breeze (see their GitHub repo):

git clone https://github.com/CLARIAH/grlc
cd grlc
docker pull clariah/grlc
docker-compose -f docker-compose.default.yml up

What the software does, is take a number of configuration files that define what the OpenAPI REST call should look, and what the underlying SPARQL is. For example, to get all projects in Wikidata with a CORDIS project identifier, we have this configration file:

#+ summary: Lists grants with a CORDIS identifier
#+ endpoint_in_url: False
#+ endpoint: http://query.wikidata.org/sparql
#+ tags:
#+   - Grants

PREFIX bd: <http://www.bigdata.com/rdf#>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?grant ?grantLabel ?cordis WHERE {
  ?grant wdt:P3400 ?cordis .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
  }
}

The full set of configuration files I hacked up (including one with a parameter) can be found here. The OpenAPI then looks something like this:


I haven't played enough with it yet, and I hope we can later use this in OpenRiskNet.

Sunday, July 01, 2018

European Union Observatory for Nanomaterials now includes eNanoMapper

The European Union Observatory for Nanomaterials (EUON) reported about two weeks ago that their observatory added two new data sets, one of which is the eNanoMapper database which includes NanoWiki and the NANoREG data. And both are exposed using the eNanoMapper database software (see this paper). It's very rewarding to see your work picked up like this and motivates me very much for the new NanoCommons project!


The European Union Observatory for Nanomaterials.

LIPID MAPS identifiers and endocannabinoids

Endocannabinoids.
Maybe I will find some more time later, but for now just a quick notice of a open notebook I kept yesterday for adding more LIPID MAPS identifiers in Wikidata. It started with a node in a WikiPathways which did not have an identifier: endocannabinoids:
This is why I am interested in Wikidata, as I can mint entries there myself (see this ICCS 2018 poster). And so I did, but when adding a chemical class, you want to specific compounds from that class too. That's where LIPID MAPS comes in, because that had info on specific compounds in that class.

Some time ago I asked about adding more LIPID MAPS identifiers to Wikidata, which has a lot of benefits for the community and LIPID MAPS. I was informed I could use their REST API to get mappings between InChIKey and their identifiers, and that is enough for me to add more of their identifiers to Wikidata (similar approach I used for the EPA CompTox Dashboard and SPLASHes). The advantages include that LIPID MAPS now can easily get data to add links to the PDB and MassBank to their lipid database (and much more).

My advantage is that I can easily query if a particular compound is a specific endocannabinoids. I created two Bioclipse scripts, and one looks like:

// ask permission to use data from their REST API (I did and got it)

restAPI = "http://www.lipidmaps.org/rest/compound/lm_id/LM/all/download"
propID = "P2063"

allData = bioclipse.downloadAsFile(
  restAPI, "/LipidMaps/lipidmaps.txt"
)


sparql = """
PREFIX wdt: 
SELECT (substr(str(?compound),32) as ?wd) ?key ?lmid WHERE {
  ?compound wdt:P235 ?key .
  MINUS { ?compound wdt:${propID} ?lmid . }
}
"""

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "https://query.wikidata.org/sparql", sparql
  )
}

def renewFile(file) {
  if (ui.fileExists(file)) ui.remove(file)
  ui.newFile(file)
  return file
}

mappingsFile = "/LipidMaps/mappings.txt"
missingCompoundFile = "/LipidMaps/missing.txt"

// ignore certain Wikidata items, where I don't want the LIPID MAPS ID added
ignores = new java.util.HashSet();
// ignores.add("Q37111097")

// make a map
map = new HashMap()
for (i=1;i<=results.rowCount;i++) {
  rowVals = results.getRow(i)
  map.put(rowVals[1], rowVals[0])  
}

batchSize = 500
batchCounter = 0
mappingContent = ""
missingContent = ""
print "Saved a batch"
renewFile(mappingsFile)
renewFile(missingCompoundFile)
new File(bioclipse.fullPath("/LipidMaps/lipidmaps.txt")).eachLine{ line ->
  fields = line.split("\t")
  if (fields.length > 15) {
    lmid = fields[1]
    inchikey = fields[15]
    if (inchikey != null && inchikey.length() > 10) {
      batchCounter++
      if (map.containsKey(inchikey)) {
        wdid = map.get(inchikey)
        if (!ignores.contains(wdid)) {
          mappingContent += "${wdid}\t${propID}\t\"${lmid}\"\tS143\tQ20968889\tS854\t\"http://www.lipidmaps.org/rest/compound/lm_id/LM/all/download\"\tS813\t+2018-06-30T00:00:00Z/11\n"
        }  
      } else {
        missingContent += "${inchikey}\n"
      }
    }
  }
  if (batchCounter >= batchSize) {
    ui.append(mappingsFile, mappingContent)
    ui.append(missingCompoundFile, missingContent)
    batchCounter = 0
    mappingContent = ""
    missingContent = ""
    print "."
  }
}
println "\n"

With that, I managed to increase the number of LIPID MAPS identifiers from 2333 to 6099, but there are an additional 38 thousand lipids not yet in Wikidata.


Many more details can be found in my notebook, but in the end I ended up with a nice Scholia page for endocannabinoids :)

Saturday, June 16, 2018

Represenation of chemistry and machine learning: what do X1, X2, and X3 mean?

Modelling doesn't always go well and the model is lousy at
predicting the experimental value (yellow).
Machine learning in chemistry, or multivariate statistics, or chemometrics, is a field that uses computational and mathematical methods to find patterns in data. And if you use them right, you can make it correlate those features to a dependent variable, allowing you to predict them from those features. Example: if you know a molecule has a carboxylic acid, then it is more acidic.

The patterns (features) and correlation needs to be established. An overfitted model will say: if it is this molecule than the pKa is that, but if it is that molecule then the pKa is such. An underfitted model will say: if there is an oxygen, than the compound is more acidic. The field of chemometrics and cheminformatics have a few decades of experience in hitting the right level of fitness. But that's a lot of literature. It took me literally a four year PhD project to get some grips on it (want a print copy for your library?).

But basically all methods work like this: if X is present, then... Whether X is numeric or categorical, X is used to make decisions. And, second, X rarely is the chemical itself, which is a cloud of nuclei and electrons. Instead, it's that representation. And that's where one of the difficulties comes in:
  1. one single but real molecular aspect can be represented by X1 and X2
  2. two real molecular aspects can be represented by X3
Ideally, every unique aspect has a unique X to represent it, but this is sometimes hard with our cheminformatics toolboxes. As studied in my thesis, this can be overcome by the statistical modelling, but there is some interplay between the representation and modelling.

So, how common is difficulty #1 and #2. Well, I was discussing #1 with a former collaborator at AstraZeneca in Sweden last Monday: we were building QSAR models including features that capture chirality (I think it was a cool project) and we wanted to use R, S chirality annotation for atoms. However, it turned out this CIP model suffers from difficulty #1: even if the 3D distribution of atoms around a chiral atom (yes, I saw the discussion about using such words on Twitter, but you know what I mean) does not change, in the CIP model, a remote change in the structure can flip the R to an S label. So, we have the exact same single 3D fragment, but an X1 and X2.

Source: Wikipedia, public domain.
Noel seems to have found in canonical SMILES another example of this. I had some trouble understanding the exact combination of representation and deep neural networks (DNN), but the above is likely to apply. A neural network has a certain number if input neurons (green, in image) and each neuron studies one X. So, think of them as neuron X1, X2, X3, etc. Each neuron has weighted links (black arrows) to intermediate neurons (blueish) that propagate knowledge about the modeled system, and those are linked to the output layer (purple), which, for example, reflects the predicted pKa. By tuning the weights the neural network learns what features are important for what output value: if X1 is unimportant, it will propagate less information (low weight).

So, it immediately visualizes what happens if we have difficulty #1: the DNN needs to learn more weights without more complex data (with a higher chance of overfitting). Similarly, if we have difficulty #2, we still have only one set of paths from a single green input neuron to that single output neuron; one path to determine the outcome of the purple neuron. If trained properly, it will reduce the weights for such nodes, and focus on other input nodes. But the problem is clear too, the original two real molecular aspects cannot be seriously taken into account.

What does that mean about Noel's canonical SMILES questions. I am not entirely sure, as I would need to know more info on how the SMILES is translated (feeded into) the green input layer. But I'm reasonable sure that it involves the two aforementioned difficulties; sure enough to write up this reply... Back to you, Noel!