Monday, July 09, 2018

Converting any SPARQL endpoint to an OpenAPI

Logo of the grlc project.
Sometimes you run into something awesome. I had that one or two months ago, when I found out about a cool project that can convert a random SPARQL endpoint into an OpenAPI endpoint: grlc. Now, May/June was really busy (well, the least few weeks before summer not much less so), but at the WikiProject Wikidata for research meeting in Berlin last month, I just had to give it a go.

There is a convenient Docker image, so setting it up was a breeze (see their GitHub repo):

git clone
cd grlc
docker pull clariah/grlc
docker-compose -f docker-compose.default.yml up

What the software does, is take a number of configuration files that define what the OpenAPI REST call should look, and what the underlying SPARQL is. For example, to get all projects in Wikidata with a CORDIS project identifier, we have this configration file:

#+ summary: Lists grants with a CORDIS identifier
#+ endpoint_in_url: False
#+ endpoint:
#+ tags:
#+   - Grants

PREFIX bd: <>
PREFIX wikibase: <>
PREFIX wdt: <>

SELECT ?grant ?grantLabel ?cordis WHERE {
  ?grant wdt:P3400 ?cordis .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".

The full set of configuration files I hacked up (including one with a parameter) can be found here. The OpenAPI then looks something like this:

I haven't played enough with it yet, and I hope we can later use this in OpenRiskNet.

Sunday, July 01, 2018

European Union Observatory for Nanomaterials now includes eNanoMapper

The European Union Observatory for Nanomaterials (EUON) reported about two weeks ago that their observatory added two new data sets, one of which is the eNanoMapper database which includes NanoWiki and the NANoREG data. And both are exposed using the eNanoMapper database software (see this paper). It's very rewarding to see your work picked up like this and motivates me very much for the new NanoCommons project!

The European Union Observatory for Nanomaterials.

LIPID MAPS identifiers and endocannabinoids

Maybe I will find some more time later, but for now just a quick notice of a open notebook I kept yesterday for adding more LIPID MAPS identifiers in Wikidata. It started with a node in a WikiPathways which did not have an identifier: endocannabinoids:
This is why I am interested in Wikidata, as I can mint entries there myself (see this ICCS 2018 poster). And so I did, but when adding a chemical class, you want to specific compounds from that class too. That's where LIPID MAPS comes in, because that had info on specific compounds in that class.

Some time ago I asked about adding more LIPID MAPS identifiers to Wikidata, which has a lot of benefits for the community and LIPID MAPS. I was informed I could use their REST API to get mappings between InChIKey and their identifiers, and that is enough for me to add more of their identifiers to Wikidata (similar approach I used for the EPA CompTox Dashboard and SPLASHes). The advantages include that LIPID MAPS now can easily get data to add links to the PDB and MassBank to their lipid database (and much more).

My advantage is that I can easily query if a particular compound is a specific endocannabinoids. I created two Bioclipse scripts, and one looks like:

// ask permission to use data from their REST API (I did and got it)

restAPI = ""
propID = "P2063"

allData = bioclipse.downloadAsFile(
  restAPI, "/LipidMaps/lipidmaps.txt"

sparql = """
PREFIX wdt: 
SELECT (substr(str(?compound),32) as ?wd) ?key ?lmid WHERE {
  ?compound wdt:P235 ?key .
  MINUS { ?compound wdt:${propID} ?lmid . }

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "", sparql

def renewFile(file) {
  if (ui.fileExists(file)) ui.remove(file)
  return file

mappingsFile = "/LipidMaps/mappings.txt"
missingCompoundFile = "/LipidMaps/missing.txt"

// ignore certain Wikidata items, where I don't want the LIPID MAPS ID added
ignores = new java.util.HashSet();
// ignores.add("Q37111097")

// make a map
map = new HashMap()
for (i=1;i<=results.rowCount;i++) {
  rowVals = results.getRow(i)
  map.put(rowVals[1], rowVals[0])  

batchSize = 500
batchCounter = 0
mappingContent = ""
missingContent = ""
print "Saved a batch"
new File(bioclipse.fullPath("/LipidMaps/lipidmaps.txt")).eachLine{ line ->
  fields = line.split("\t")
  if (fields.length > 15) {
    lmid = fields[1]
    inchikey = fields[15]
    if (inchikey != null && inchikey.length() > 10) {
      if (map.containsKey(inchikey)) {
        wdid = map.get(inchikey)
        if (!ignores.contains(wdid)) {
          mappingContent += "${wdid}\t${propID}\t\"${lmid}\"\tS143\tQ20968889\tS854\t\"\"\tS813\t+2018-06-30T00:00:00Z/11\n"
      } else {
        missingContent += "${inchikey}\n"
  if (batchCounter >= batchSize) {
    ui.append(mappingsFile, mappingContent)
    ui.append(missingCompoundFile, missingContent)
    batchCounter = 0
    mappingContent = ""
    missingContent = ""
    print "."
println "\n"

With that, I managed to increase the number of LIPID MAPS identifiers from 2333 to 6099, but there are an additional 38 thousand lipids not yet in Wikidata.

Many more details can be found in my notebook, but in the end I ended up with a nice Scholia page for endocannabinoids :)

Saturday, June 16, 2018

Represenation of chemistry and machine learning: what do X1, X2, and X3 mean?

Modelling doesn't always go well and the model is lousy at
predicting the experimental value (yellow).
Machine learning in chemistry, or multivariate statistics, or chemometrics, is a field that uses computational and mathematical methods to find patterns in data. And if you use them right, you can make it correlate those features to a dependent variable, allowing you to predict them from those features. Example: if you know a molecule has a carboxylic acid, then it is more acidic.

The patterns (features) and correlation needs to be established. An overfitted model will say: if it is this molecule than the pKa is that, but if it is that molecule then the pKa is such. An underfitted model will say: if there is an oxygen, than the compound is more acidic. The field of chemometrics and cheminformatics have a few decades of experience in hitting the right level of fitness. But that's a lot of literature. It took me literally a four year PhD project to get some grips on it (want a print copy for your library?).

But basically all methods work like this: if X is present, then... Whether X is numeric or categorical, X is used to make decisions. And, second, X rarely is the chemical itself, which is a cloud of nuclei and electrons. Instead, it's that representation. And that's where one of the difficulties comes in:
  1. one single but real molecular aspect can be represented by X1 and X2
  2. two real molecular aspects can be represented by X3
Ideally, every unique aspect has a unique X to represent it, but this is sometimes hard with our cheminformatics toolboxes. As studied in my thesis, this can be overcome by the statistical modelling, but there is some interplay between the representation and modelling.

So, how common is difficulty #1 and #2. Well, I was discussing #1 with a former collaborator at AstraZeneca in Sweden last Monday: we were building QSAR models including features that capture chirality (I think it was a cool project) and we wanted to use R, S chirality annotation for atoms. However, it turned out this CIP model suffers from difficulty #1: even if the 3D distribution of atoms around a chiral atom (yes, I saw the discussion about using such words on Twitter, but you know what I mean) does not change, in the CIP model, a remote change in the structure can flip the R to an S label. So, we have the exact same single 3D fragment, but an X1 and X2.

Source: Wikipedia, public domain.
Noel seems to have found in canonical SMILES another example of this. I had some trouble understanding the exact combination of representation and deep neural networks (DNN), but the above is likely to apply. A neural network has a certain number if input neurons (green, in image) and each neuron studies one X. So, think of them as neuron X1, X2, X3, etc. Each neuron has weighted links (black arrows) to intermediate neurons (blueish) that propagate knowledge about the modeled system, and those are linked to the output layer (purple), which, for example, reflects the predicted pKa. By tuning the weights the neural network learns what features are important for what output value: if X1 is unimportant, it will propagate less information (low weight).

So, it immediately visualizes what happens if we have difficulty #1: the DNN needs to learn more weights without more complex data (with a higher chance of overfitting). Similarly, if we have difficulty #2, we still have only one set of paths from a single green input neuron to that single output neuron; one path to determine the outcome of the purple neuron. If trained properly, it will reduce the weights for such nodes, and focus on other input nodes. But the problem is clear too, the original two real molecular aspects cannot be seriously taken into account.

What does that mean about Noel's canonical SMILES questions. I am not entirely sure, as I would need to know more info on how the SMILES is translated (feeded into) the green input layer. But I'm reasonable sure that it involves the two aforementioned difficulties; sure enough to write up this reply... Back to you, Noel!

Saturday, June 02, 2018

Supplementary files: just an extra, or essential information?

I recently had a discussion about supplementary information (additional files, etc):

The Narrative
Journal articles have evolved in an elaborate dissemination channel focusing on the narrative of the new finding. Some journals focus on recording all the details to reproduce the work, while others focus just on the narrative, overviews, and impact. Sadly, there are no databases that tell you which journal does what.

One particularly interesting example is Nature Methods, a journal dedicated to scientific methods... one would assume the method to be an important part of the article, right? No, think again, as seen in the PDF of this article (doi:10.1038/s41592-018-0009-z):

Supplementary information
An intermediate solution is the supplementary information part of publications. Supplementary information, also called Additional files, are a way for author to provide more detail. In an proper Open Science world, it would not exist: all the details would be available in data repositories, databases, electronic notebooks, source code repositories, etc, etc. The community is moving in that direction, and publishers are slowly picking this up (slowly? Yes, more than 7 years ago I wrote about the same topic): supplementary information remains.

But there are some issues with supplementary information (SI), which I will discuss below. But let me first say that I do not like this idea of supplementary information at all: content is either important to the paper, or it is not. If it is just relevant, then just cite it.

I am not the only one who finds it not convenient that SI is not integral part of a publication.
Problem #1: peer review
A few problems is that it is not integral part of the publications. Those journals that still do print, and there likely lies the original of the concept, do not want to print all the details, because it would make the journal too thick. But when an article is peer-reviewed, content in the main article is better reviewed than the SI. Does this cause problems? Sure! Remember this SI content analysis of Excel content (doi:10.1186/s13059-016-1044-7)?

One of the issues here is that publishers do not generally have computer-assisted peer-review / authoring tools in place; tools that help authors and reviewers get a faster and better overview of the material they are writing and/or reviewing. Excel gene ID/data issues can be prevented, if we only wanted. Same for SMILES strings for chemical compounds, see e.g. How to test SMILES strings in Supplementary Information but there is much older examples of this, e.g. by the Murray-Rust lab (see doi:10.1039/B411033A).

Problem #2: archiving
A similar problems is that (national/university) libraries do not routinely archive the supplementary information. I cannot go to my library and request the SI of an article that I can read via my library. It gets worse, a few years ago I was told (from someone with inside info) a big publisher does not guarantee that SI is archived at all:

I'm sure Elsevier had something in place now (which I assume to be based on Mendeley Data), as competitors have too, using Figshare, such as BioMed Central:

But mind you, Figshare promises archival for a longer period of time, but nothing that comes close to what libraries provide. So, I still think the Dutch universities must handle such data repositories (and databases) as essential component of what to archive long time, just like books.

Problem #3: reuse
A final problem I like to touch upon is very briefly is reuse. In cheminformatics it is quite common to reuse SI from other articles. I assume mostly because the information in that dataset is not available from other resources. With SI increasingly available from Figshare, this may change. However, reuse of SI by databases is rare and also not always easy.

But this needs a totally different story, because there are so many aspects to reuse of SI...