Saturday, August 18, 2018

Compound (class) identifiers in Wikidata

Bar chart showing the number of compounds
with a particular chemical identifier.
I think Wikidata is a groundbreaking project, which will have a major impact on science. One of the reasons is the open license (CCZero), the very basic approach (Wikibase), and the superb community around it. For example, setting up your own Wikibase including a cool SPARQL endpoint, is easily done with Docker.

Wikidata has many sub projects, such as WikiCite, which captures the collective of primary literature. Another one is the WikiProject Chemistry. The two nicely match up, I think, making a public database linking chemicals to literature (tho, very much needs to be done here), see my recent ICCS 2018 poster (doi:10.6084/m9.figshare.6356027.v1, paper pending).

But Wikidata is also a great resource for identifier mappings between chemical databases, something we need for our metabolism pathway research. The mapping, as you may know, are used in the latter via BridgeDb and we have been using Wikidata as one of three sources for some time now (the others being HMDB and ChEBI). WikiProject Chemistry has a related ChemID effort, and while the wiki page does not show much recent activity, there is actually a lot of ongoing effort (see plot). And I've been adding my bits.

Limitations of the links
But not each identifier in Wikidata has the same meaning. While they are all classified as 'external-id', the actual link may have different meaning. This, of course, is the essence of scientific lenses, see this post and the papers cited therein. One reason here is the difference in what entries in the various databases mean.

Wikidata has an extensive model, defined by the aforementioned WikiProject Chemistry. For example, it has different concepts for chemical compounds (in fact, the hierarchy is pretty rich) and compound classes. And these are differently modeled. Furthermore, it has a model that formalizes that things with a different InChI are different, but even allows things with the same InChI to be different, if need arises. It tries to accurately and precisely capture the certainty and uncertainty of the chemistry. As such, it is a powerful system to handle identifier mappings, because databases are not clear, and chemistry and biological in data is even less: we measure experimentally a characterization of chemicals, but what we put in databases and give names, are specific models (often chemical graphs).

That model differs from what other (chemical) databases use, or seem to use, because not always do databases indicate what they actually have in a record. But I think this is a fair guess.

ChEBI (and the matching ChEBI ID) has entries for chemical classes (e.g. fatty acid) and specific compounds (e.g. acetate).

PubChem, ChemSpider, UniChem
These three resources use the InChI as central asset. While they do not really have the concept of compound classes so much (though increasingly they have classifications), they do have entries where stereochemistry is undefined or unknown. Each one has their own way to link to other databases themselves, which normally includes tons of structure normalization (see e.g. doi:10.1186/s13321-018-0293-8 and doi:10.1186/s13321-015-0072-8)

HMDB (and the matching P2057) has a biological perspective; the entries reflect the biology of a chemical. Therefore, for most compounds, they focus on the neutral forms of compounds. This makes linking to/from other databases where the compound is not neutral chemically less precise.

CAS registry numbers
CAS (and the matching P231) is pretty unique itself, and has identifiers for substances (see Q79529), much more than chemical compounds, and comes with a own set of unique features. For example, solutions of some compound, by design, have the same identifier. Previously, formaldehyde and formalin had different Wikipedia/Wikidata pages, both with the same CAS registry number.

Limitations of the links #2
Now, returning to our starting point: limitations in linking databases. If we want FAIR mappings, we need to be as precise as possible. Of course, that may mean we need more steps, but we can always simplify at will, but we never can have a computer make the links more complex (well, not without making assumptions, etc).

And that is why Wikidata is so suitable to link all these chemical databases: it can distinguish differences when needed, and make that explicit. It make mappings between the databases more FAIR.

Thursday, August 09, 2018

Alternative OpenAPIs around WikiPathways

I blogged in July about something I learned at a great Wikidata/ERC meeting in June: grlc. It's comparable to but different from the Open PHACTS API: it's a lot more general (and works with any SPARQL end point), but also does not have the identifier mapping service (based on BridgeDb) which we need to link the verious RDF data sets in Open PHACTS.

Of course, WikiPathways already has a OpenAPI and it's more powerful than we can do based on just the WikiPathways RDF (for various reasons), but the advantage is that you can expose any SPARQL query (see the examples at on the WikiPathways end point. As explained in July, you only have to set up a magic GitHub repository, and Chris suggested to show how this could be used to mimick some of the existing API methods.

The magic
The magic is defined in this GitHub repository, which currently exposes a single method:

#+ summary: Lists Organisms
#+ endpoint_in_url: False
#+ endpoint:
#+ tags:
#+   - Organism list

PREFIX rdfs:    

SELECT DISTINCT (str(?label) as ?organism)
    ?concept wp:organism ?organism ;
      wp:organismName ?label .

The result
I run grlc in the normal way and point it to egonw/wp-rdf-api and the result looks like:

And executing the method in this GUI (click the light blue bar of the method), results in a nice CSV reply:

Of course, because there is SPARQL behind each method, you can make any query you like, creating any OpenAPI methods that fit your data analysis workflow.

Wednesday, August 08, 2018

Green Open Access: increase your Open Access rate; and why stick with the PDF?

Icon of Unpaywall, a must have
browser extension for the modern
Researchers of my generation (and earlier generations) have articles from the pre-Open Access era. Actually, I have even be tricked into closed access later; with a lot of pressure to publish as much as you can (which some see as a measure of your quality), it's impossible to not make an occasional misstep. But then there is Green Open Access (aka self-archiving), a concept I don't like, but is useful in those situations. One reason why I do not like it, is that there are many shades of green, and, yes, they all hurt: every journal has special rules. Fortunately, the brilliant SHERPA/RoMEO captures this.

Now, the second event that triggered this effort was my recent experience with Markdown (e.g. the eNanoMapper tutorials) and how platform like GitHub/GitLab built systems around it to publish this easily.

Why this matters to me? If I want to have my work have impact, I need people to be able to read my work. Open Access is one route. Of course, they can also email me for a copy the article, but I tend to be busy with getting new grants, supervision, etc. BTW, you can easily calculate your Open Access rate with ImpactStory, something you should try at least once in your life...

Step 1: identify which articles need an green Open Access version
Here, Unpaywall is the right tool, which does a brilliant job at identifying free versions. After all, one of your co-authors may already have self-archived it somewhere. So, yes, I do have a short list, one one of the papers was the second CDK paper (doi:10.2174/138161206777585274). The first CDK article was made CC-BY three years ago, with the ACS AuthorChoice program, but Current Pharmaceutical Design (CPD) does not have that option, as far as I know.

Step 2: check your author rights for green Open Access
The next step is to check SHERPA/RoMEO for your self-archiving rights. This is essential, as this is different for every journal; this is basically business model by obscurity, and without any standardization this is not FAIR in any way. For CDP it reports that I have quite a few rights (more than some bigger journals that still rely on Green to call themselves an "leading open access publisher", but also less than some others):

SHERPA/RoMEO report for CPD.
Many journals do not allow you to self-archive the post-print version. And that sucks, because a preprint is often quite similar, but just not the same deal (which is exactly what closed access publishers want). But being able to post the post-print version is brilliant, because few people actually even kept the last submitted version (again, exactly what closed access publishers want). This report also tells you where you can archive it, and that is not always the same either: it's not uncommon that self-archiving on something like Mendeley or Zotero is not allowed.

Step 3: a post-print version that is not the publisher PDF??
Ah, so you know what version of the article you can archive, and where. But we cannot archive the publisher PDF. So, no downloading of the PDF from the publisher website and putting that online.

Step 4: a custom PDF
Because in this case we are allowed to archive the post-print version, I am allowed to copy/paste the content from the publisher PDF. I can just create a new Word/LibreOffice document with that content, removing the publisher layout and publisher content, and make a new PDF of that. A decent PDF reader allows you to copy/paste large amounts of content in one go, and Linux/Win10 users can use pdfimages to extract the images from the PDF for reuse.

Step 5: why stick with the PDF?
But why would we stick with a PDF? Why not use something more machine readable? Something where that support syntax highlighting, downloading of table content as CSV, etc? And that made me think of my recent experiments with Markdown.

So, I started of with making a Markdown version of the second CDK paper.

In this process, I:

  1. removed hyphenation used to fit words/sentences nicely in PDF columns;
  2. wrapped the code sections for syntax highlighting
  3. recovered the images with pdfimages;
  4. converted the table content to CSV (and used Markdown Tables Generator to create Markdown content) and added "Download as CSV" links to the table captions;
  5. made the URLs clickable; and,
  6. added ORCID icons for the authors (where known).
Preview of the self-archived post-print of the second CDK article.
Step 6: tweet the free Green Open Access link
Of course, if no one knows about your effort, they cannot find your self-archived version. In due time, Google Scholar may pick it up, but I am not sure yet. Maybe (Bio) will help, but that is something I have yet to explore.

It's important to include the DOI URL in that link, so that the self-archived version will be linked to from services like

Next steps: get Unpaywall to know about your self-archived version
This is something I am actively exploring. When I know the steps to achieve this, I will report on that in this blog.

Saturday, August 04, 2018

WikiPathways Summit 2018

I was not there when WikiPathways was founded; I only joined in 2012 and I found my role in the area of metabolic pathways of this Open knowledge base (CC0, to be precise) of biological processes. This autumn, A WikiPathways Summit 2018 is organized in San Francisco to celebrate the 10th anniversary of the project, and everyone interested is kindly invited to join for three days of learning about WikiPathways, integrations, and use cases, data curation, and hacking on this great Open Science project.

Things that I would love to talk about (besides making metabolic pathways FAIR and Openly available) are the integrations with other platforms (Reactome, RaMP, MetaboLights, Pathway Commons, PubChem, Open PHACTS (using the RDF), etc, etc), Wikidata interoperability, and future interoperability with platoforms like AOPWiki, Open Targets, BRENDA, Europe PMC, etc, etc, etc.

Monday, July 09, 2018

Converting any SPARQL endpoint to an OpenAPI

Logo of the grlc project.
Sometimes you run into something awesome. I had that one or two months ago, when I found out about a cool project that can convert a random SPARQL endpoint into an OpenAPI endpoint: grlc. Now, May/June was really busy (well, the least few weeks before summer not much less so), but at the WikiProject Wikidata for research meeting in Berlin last month, I just had to give it a go.

There is a convenient Docker image, so setting it up was a breeze (see their GitHub repo):

git clone
cd grlc
docker pull clariah/grlc
docker-compose -f docker-compose.default.yml up

What the software does, is take a number of configuration files that define what the OpenAPI REST call should look, and what the underlying SPARQL is. For example, to get all projects in Wikidata with a CORDIS project identifier, we have this configration file:

#+ summary: Lists grants with a CORDIS identifier
#+ endpoint_in_url: False
#+ endpoint:
#+ tags:
#+   - Grants

PREFIX bd: <>
PREFIX wikibase: <>
PREFIX wdt: <>

SELECT ?grant ?grantLabel ?cordis WHERE {
  ?grant wdt:P3400 ?cordis .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".

The full set of configuration files I hacked up (including one with a parameter) can be found here. The OpenAPI then looks something like this:

I haven't played enough with it yet, and I hope we can later use this in OpenRiskNet.

Sunday, July 01, 2018

European Union Observatory for Nanomaterials now includes eNanoMapper

The European Union Observatory for Nanomaterials (EUON) reported about two weeks ago that their observatory added two new data sets, one of which is the eNanoMapper database which includes NanoWiki and the NANoREG data. And both are exposed using the eNanoMapper database software (see this paper). It's very rewarding to see your work picked up like this and motivates me very much for the new NanoCommons project!

The European Union Observatory for Nanomaterials.

LIPID MAPS identifiers and endocannabinoids

Maybe I will find some more time later, but for now just a quick notice of a open notebook I kept yesterday for adding more LIPID MAPS identifiers in Wikidata. It started with a node in a WikiPathways which did not have an identifier: endocannabinoids:
This is why I am interested in Wikidata, as I can mint entries there myself (see this ICCS 2018 poster). And so I did, but when adding a chemical class, you want to specific compounds from that class too. That's where LIPID MAPS comes in, because that had info on specific compounds in that class.

Some time ago I asked about adding more LIPID MAPS identifiers to Wikidata, which has a lot of benefits for the community and LIPID MAPS. I was informed I could use their REST API to get mappings between InChIKey and their identifiers, and that is enough for me to add more of their identifiers to Wikidata (similar approach I used for the EPA CompTox Dashboard and SPLASHes). The advantages include that LIPID MAPS now can easily get data to add links to the PDB and MassBank to their lipid database (and much more).

My advantage is that I can easily query if a particular compound is a specific endocannabinoids. I created two Bioclipse scripts, and one looks like:

// ask permission to use data from their REST API (I did and got it)

restAPI = ""
propID = "P2063"

allData = bioclipse.downloadAsFile(
  restAPI, "/LipidMaps/lipidmaps.txt"

sparql = """
PREFIX wdt: 
SELECT (substr(str(?compound),32) as ?wd) ?key ?lmid WHERE {
  ?compound wdt:P235 ?key .
  MINUS { ?compound wdt:${propID} ?lmid . }

if (bioclipse.isOnline()) {
  results = rdf.sparqlRemote(
    "", sparql

def renewFile(file) {
  if (ui.fileExists(file)) ui.remove(file)
  return file

mappingsFile = "/LipidMaps/mappings.txt"
missingCompoundFile = "/LipidMaps/missing.txt"

// ignore certain Wikidata items, where I don't want the LIPID MAPS ID added
ignores = new java.util.HashSet();
// ignores.add("Q37111097")

// make a map
map = new HashMap()
for (i=1;i<=results.rowCount;i++) {
  rowVals = results.getRow(i)
  map.put(rowVals[1], rowVals[0])  

batchSize = 500
batchCounter = 0
mappingContent = ""
missingContent = ""
print "Saved a batch"
new File(bioclipse.fullPath("/LipidMaps/lipidmaps.txt")).eachLine{ line ->
  fields = line.split("\t")
  if (fields.length > 15) {
    lmid = fields[1]
    inchikey = fields[15]
    if (inchikey != null && inchikey.length() > 10) {
      if (map.containsKey(inchikey)) {
        wdid = map.get(inchikey)
        if (!ignores.contains(wdid)) {
          mappingContent += "${wdid}\t${propID}\t\"${lmid}\"\tS143\tQ20968889\tS854\t\"\"\tS813\t+2018-06-30T00:00:00Z/11\n"
      } else {
        missingContent += "${inchikey}\n"
  if (batchCounter >= batchSize) {
    ui.append(mappingsFile, mappingContent)
    ui.append(missingCompoundFile, missingContent)
    batchCounter = 0
    mappingContent = ""
    missingContent = ""
    print "."
println "\n"

With that, I managed to increase the number of LIPID MAPS identifiers from 2333 to 6099, but there are an additional 38 thousand lipids not yet in Wikidata.

Many more details can be found in my notebook, but in the end I ended up with a nice Scholia page for endocannabinoids :)

Saturday, June 16, 2018

Represenation of chemistry and machine learning: what do X1, X2, and X3 mean?

Modelling doesn't always go well and the model is lousy at
predicting the experimental value (yellow).
Machine learning in chemistry, or multivariate statistics, or chemometrics, is a field that uses computational and mathematical methods to find patterns in data. And if you use them right, you can make it correlate those features to a dependent variable, allowing you to predict them from those features. Example: if you know a molecule has a carboxylic acid, then it is more acidic.

The patterns (features) and correlation needs to be established. An overfitted model will say: if it is this molecule than the pKa is that, but if it is that molecule then the pKa is such. An underfitted model will say: if there is an oxygen, than the compound is more acidic. The field of chemometrics and cheminformatics have a few decades of experience in hitting the right level of fitness. But that's a lot of literature. It took me literally a four year PhD project to get some grips on it (want a print copy for your library?).

But basically all methods work like this: if X is present, then... Whether X is numeric or categorical, X is used to make decisions. And, second, X rarely is the chemical itself, which is a cloud of nuclei and electrons. Instead, it's that representation. And that's where one of the difficulties comes in:
  1. one single but real molecular aspect can be represented by X1 and X2
  2. two real molecular aspects can be represented by X3
Ideally, every unique aspect has a unique X to represent it, but this is sometimes hard with our cheminformatics toolboxes. As studied in my thesis, this can be overcome by the statistical modelling, but there is some interplay between the representation and modelling.

So, how common is difficulty #1 and #2. Well, I was discussing #1 with a former collaborator at AstraZeneca in Sweden last Monday: we were building QSAR models including features that capture chirality (I think it was a cool project) and we wanted to use R, S chirality annotation for atoms. However, it turned out this CIP model suffers from difficulty #1: even if the 3D distribution of atoms around a chiral atom (yes, I saw the discussion about using such words on Twitter, but you know what I mean) does not change, in the CIP model, a remote change in the structure can flip the R to an S label. So, we have the exact same single 3D fragment, but an X1 and X2.

Source: Wikipedia, public domain.
Noel seems to have found in canonical SMILES another example of this. I had some trouble understanding the exact combination of representation and deep neural networks (DNN), but the above is likely to apply. A neural network has a certain number if input neurons (green, in image) and each neuron studies one X. So, think of them as neuron X1, X2, X3, etc. Each neuron has weighted links (black arrows) to intermediate neurons (blueish) that propagate knowledge about the modeled system, and those are linked to the output layer (purple), which, for example, reflects the predicted pKa. By tuning the weights the neural network learns what features are important for what output value: if X1 is unimportant, it will propagate less information (low weight).

So, it immediately visualizes what happens if we have difficulty #1: the DNN needs to learn more weights without more complex data (with a higher chance of overfitting). Similarly, if we have difficulty #2, we still have only one set of paths from a single green input neuron to that single output neuron; one path to determine the outcome of the purple neuron. If trained properly, it will reduce the weights for such nodes, and focus on other input nodes. But the problem is clear too, the original two real molecular aspects cannot be seriously taken into account.

What does that mean about Noel's canonical SMILES questions. I am not entirely sure, as I would need to know more info on how the SMILES is translated (feeded into) the green input layer. But I'm reasonable sure that it involves the two aforementioned difficulties; sure enough to write up this reply... Back to you, Noel!

Saturday, June 02, 2018

Supplementary files: just an extra, or essential information?

I recently had a discussion about supplementary information (additional files, etc):

The Narrative
Journal articles have evolved in an elaborate dissemination channel focusing on the narrative of the new finding. Some journals focus on recording all the details to reproduce the work, while others focus just on the narrative, overviews, and impact. Sadly, there are no databases that tell you which journal does what.

One particularly interesting example is Nature Methods, a journal dedicated to scientific methods... one would assume the method to be an important part of the article, right? No, think again, as seen in the PDF of this article (doi:10.1038/s41592-018-0009-z):

Supplementary information
An intermediate solution is the supplementary information part of publications. Supplementary information, also called Additional files, are a way for author to provide more detail. In an proper Open Science world, it would not exist: all the details would be available in data repositories, databases, electronic notebooks, source code repositories, etc, etc. The community is moving in that direction, and publishers are slowly picking this up (slowly? Yes, more than 7 years ago I wrote about the same topic): supplementary information remains.

But there are some issues with supplementary information (SI), which I will discuss below. But let me first say that I do not like this idea of supplementary information at all: content is either important to the paper, or it is not. If it is just relevant, then just cite it.

I am not the only one who finds it not convenient that SI is not integral part of a publication.
Problem #1: peer review
A few problems is that it is not integral part of the publications. Those journals that still do print, and there likely lies the original of the concept, do not want to print all the details, because it would make the journal too thick. But when an article is peer-reviewed, content in the main article is better reviewed than the SI. Does this cause problems? Sure! Remember this SI content analysis of Excel content (doi:10.1186/s13059-016-1044-7)?

One of the issues here is that publishers do not generally have computer-assisted peer-review / authoring tools in place; tools that help authors and reviewers get a faster and better overview of the material they are writing and/or reviewing. Excel gene ID/data issues can be prevented, if we only wanted. Same for SMILES strings for chemical compounds, see e.g. How to test SMILES strings in Supplementary Information but there is much older examples of this, e.g. by the Murray-Rust lab (see doi:10.1039/B411033A).

Problem #2: archiving
A similar problems is that (national/university) libraries do not routinely archive the supplementary information. I cannot go to my library and request the SI of an article that I can read via my library. It gets worse, a few years ago I was told (from someone with inside info) a big publisher does not guarantee that SI is archived at all:

I'm sure Elsevier had something in place now (which I assume to be based on Mendeley Data), as competitors have too, using Figshare, such as BioMed Central:

But mind you, Figshare promises archival for a longer period of time, but nothing that comes close to what libraries provide. So, I still think the Dutch universities must handle such data repositories (and databases) as essential component of what to archive long time, just like books.

Problem #3: reuse
A final problem I like to touch upon is very briefly is reuse. In cheminformatics it is quite common to reuse SI from other articles. I assume mostly because the information in that dataset is not available from other resources. With SI increasingly available from Figshare, this may change. However, reuse of SI by databases is rare and also not always easy.

But this needs a totally different story, because there are so many aspects to reuse of SI...

Friday, May 25, 2018

Silverbacks and scientific progress: no more co-authorship for just supervision

A silverback gorilla. CC-BY-SA Raul654.
Barend Mons (GO FAIR) frequently uses the term silverback to refer to more senior scientists effectively (intentionally or not) blocking progress. When Bjoern Brembs posted on G+ today that Stevan Harnad proposed to publish all research online, I was reminded of Mons' gorillas.

My conclusion is basically that every senior scholar (after PhD) is basically a silverback. And the older we get the more back we become, and the less silver. That includes me, I'm fully aware of that. I'm trying to give the PhD candidates I am supervising (Ryan, Denise, Marvin) as much space as I can and focus only on what I can teach them. Fairly, I am limited in that too: grants put pressure on what the candidates must deliver on top of the thesis.

The problem is the human bias that we prefer to listen to more senior people. Most of us fail that way. It takes great effort to overcome that bias. Off topic, that is one thing which I really like about the International Conference on Chemical Structures that starts this Sunday: no invited speakers, no distinction between PhD candidates and award winners (well, we get pretty close to that); also, organizers and SAB members never get an oral presentation: the silverbacks take a step back.

But 80% of the innovation, discovery we do is progress that is hanging in the air. Serendipity, availability of and access to the right tools (which explains a lot of why "top" universities stay "top"), introduce some bias to who is lucky enough to find it. It's privilege.

No more co-authorship for just supervision
Besides the so many other things that need serious revision in journal publishing (really, we're trying that at J.Cheminform!), one thing is that we must stop being co-author on papers, just for being supervisor: if we did not contribute practical research, we should not be co-author.

Of course, the research world does not work like that. Because people count articles (rather than seeing what research someone does); we value grant acquisition more than doing research (only 20% of my research time is still research, and even that small amount takes great effort). And full professors are judged on the number of papers they publish, rather than the amount of research done by people in his group. Of course, the supervision is essential, but that makes you a great teacher, not an active researcher.

BTW, did you notice that Nobel prizes are always awarded for work to last authors of the papers describing the work, and the award never seems to mention the first author?

BTW, noticed how sneakingly the gender-bias sneaked in? Just to make clear, female scholars can be academic silverbacks just as well!

Sunday, March 25, 2018

SPLASHes in Wikidata

Mass spectrum from the OSDB (see also this post).
A bit over a year ago I added EPA CompTox Dashboard IDs to Wikidata. Considering that an entry in that database means that likely is something known about the adverse properties of that compound, the identifier can be used as proxy for that. Better, once the EPA team starts supporting RDF with a SPARQL end point, we will be able to do some cool federated queries.

For metabolomics the availability of mass spectra is of interest for metabolite identification. A while ago the SPLASH was introduced (doi:10.1038/nbt.3689), and adopted by several databases around the world. After the recent metabolomics winterschool it became apparent that this is now enough adopted to be used in Wikidata. So, I proposed a new SPLASH Wikidata property, which was approved last week (see P4964). The MassBank of North America (MoNA; Fiehn's lab) team made available a mapping the links the InChI for the compounds with SPLASH identifiers for spectra for that compound, as CCZero.

So, over the weekend I pushed some 37 thousand SPLASHes into Wikidata :)

This is for about 4800 compounds.

Yes, technically, I used the same Bioclipse script approach as with the CompTox identifiers, resulting in QuickStatements. Next up is SPLASHs from the Chalk's aforementioned OSDB.

Wednesday, February 21, 2018

When were articles cited by WikiPathways published?

Number of articles cited by curated WikiPathways, using
data in Wikidata (see text).
One of the consequences of the high publication pressure is that we cannot keep up converting all those facts in knowledge bases. Indeed, publishers, journals more specifically do not care so much about migrating new knowledge into such bases. Probably this has to do with the business: they give the impression they are more interested in disseminating PDFs than disseminating knowledge. Yes, sure there are projects around this, but they are missing the point, IMHO. But that's the situation and text mining and data curation will be around for the next decade at the very least.

That make any database uptodateness pretty volatile. Our knowledge extends every 15 seconds [0,1] and extracting machine readable facts accurately (i.e. as the author intended) is not trivial. Thankfully we have projects like ContentMine! Keeping database content up to date is still a massive task. Indeed, I have a (electronic) pile of 50 recent papers of which I want to put facts into WikiPathways.

That made me wonder how WikiPathways is doing. That is, in which years are the articles published cited by pathways from the "approved" collection (the collection of pathways suitable for pathway analysis). After all, if it does not include the latest knowledge, people will be less eager to use it to analyse their excellent new data.

Now, the WikiPathways RDF only provides the PubMed identifiers of cited articles, but Andra Waagmeester (Micelio) put a lot of information in Wikidata (mind you, several pathways were already in Wikidata, because they were in Wikipedia). That data is current not complete. The current count of cited PubMed identifiers (~4200) can be counted on the WikiPathways SPARQL end point with:
    PREFIX cur: <>
    SELECT (COUNT(DISTINCT ?pubmed) AS ?count)
    WHERE {
      ?pubmed a wp:PublicationReference ;
        dcterms:isPartOf ?pathway .
      ?pathway wp:ontologyTag cur:AnalysisCollection .
Wikidata, however, lists at this moment about 1200:
    SELECT (COUNT(DISTINCT ?citedArtice) AS ?count) WHERE {
      ?pathway wdt:P2410 ?wpid ;
               wdt:P2860 ?citedArtice .
Taking advantage of the Wikidata Query Service visualization options, we can generate a graphical overview with this query:
    SELECT (STR(SAMPLE(?year)) AS ?year)
           (COUNT(DISTINCT ?citedArtice) AS ?count)
    WHERE {
      ?pathway wdt:P2410 ?wpid ;
               wdt:P2860 ?citedArtice .
      ?citedArtice wdt:P577 ?pubDate .
      BIND (year(?pubDate) AS ?year)
    } GROUP BY ?year
The result is the figure given as the start (right) of this post.

Saturday, February 17, 2018

FAIR-er Compound Interest Christmas Advent 2017: learnability and citability

Compound Interest infographics
of yesterday.
I love Compound Interest! I love what it does for popularization of the chemistry in our daily life. I love that the infographics have a pretty liberal license.

But I also wish they would be more usable. That is, the usability is greatly diminished by the lack of learnability. Of course, there is not a lot of room to give pointers.  Second, they do not have DOIs and are hard to cite as source. That said, the lack of sourcing information may not make it the best source, but let's consider these aspects separate. I would also love to see the ND clause got dropped, as it makes it harder to translate these infographics (you do not have that legal permission to do so) and fixing small glitches has to involve Andy Brunning personally.

The latter I cannot change, but the license allows me to reshare the graphics. I contacted Andy and proposed something I wanted to try. This post details some of the outcomes of that.

Improving the citability
This turns out to be the easy part, thanks to the great integration of GitHub and Zenodo. So, I just started a GitHub repository, added the proper license, and copied in the graphics. I wrapped it with some Markdown, taking advantage of another cool GitHub feature, and got this simple webpage:

By making the result a release, it got automatically archived on Zenodo. Now Andy's Compound Interest Christmas Advent 2017 has a DOI: 10.5281/zenodo.1164788:

So, this archive can be cited as:
    Andy Brunning, & Egon Willighagen. (2018, February 2). egonw/ci-advent-2017: Compound Interest Christmas Advent 2017 - Version 1 (Version v1). Zenodo.
Clearly, my contribution is just the archiving and, well, what I did as explained in the next section. The real work is done by Andy Brunning, of course!

Improving the learnability
One of the reasons I love the graphics, it that is shows the chemicals around is. Just look aside your window and you'll see the chemicals that make flowers turn colorful, berries taste well, and several plants poisonous. Really, just look outside! You see them now? (BTW, forget about that nanopores and minions, I want my portable metabolomics platform :)

But if I want to learn more about those chemicals (what are their properties, how do I extract them from the plants, how will I recognize them, what toxins are I (deliberately, but in very low doses) eating during lunch, who discovered them, etc, etc?), those infographics don't help me.

Scholia to the rescue (see doi:10.1007/978-3-319-70407-4_36): using Wikidata (and SPARQL queries) this can tell me a lot about chemicals, and there is a good community that cares about the above questions too, and adds information to Wikidata. Make sure to check out WikiProject Chemistry. All it needed is a Scholia extension for chemicals, something we've been working on. For example, check out bornyl acetate (from Day 2: Christmas tree aroma):

This list of identifiers is perhaps not the most interesting, and we're still working out how we can make it properly link out with the current code. Also, this compound is not so interesting for properties, but if there is enough information, it can look list this (for acetic acid):

I can recommend exploring the information it provides, and note the links to literature (which may include primary literature, though not in this screenshot).

But I underestimated the situation, as Compound Interest actually includes quite a few compound classes, and I had yet to develop a Scholia aspect for that. Fortunately, I got that finished too (and I love it), and it as distinct features and properly integrated, but to give you some idea, here is what phoratoxin (see Day 9: Poisonous Mistletoe) looks like:

Well, I'm sure it will look quite different in a year from now, but I hope you can see where this is going. It is essential we improve the FAIR-ness (see doi:10.1038/sdata.2016.18) of resources, small and large. If project like Compound Interest would set an example, this will show the next generation scientists how to do science better.

Tuesday, February 06, 2018

PubMed Commons is shutting down

Where NanoCommons only just started, another Commons, PubMed Commons, is shutting down. There is a lot of discussion about this and many angles. But the bottom line is, not enough people used it.

That leaves me with the question what to do with those 39 comments I left on the system (see screenshot on the right). I can copy/paste them to PubPeer, ScienceOpen, or something else. Someone also brought up that those services can go down to (life cycle) and maybe I should archive them?

Or maybe they are not important enough to justify the effort?

I will keep you posted...


Saturday, February 03, 2018

The NanoCommons project has started

NanoCommons is a new European Commission H2020 project that started earlier this year (project id: 731032). Last week we had a kick-off meeting in Salzburg. The objective of the project is (as reported in CORDIS):
    Nanotechnologies and the resulting novel and emerging materials (NEMs) represent major areas of investment and growth for the European economy. Recent advances have enabled confidence in the understanding of what constitutes toxicity of NEMs in relation to health and environmental hazards. However, the nanotechnology and nanosafety communities remain disparate and unconnected, whilst knowledge and data remain fragmented and inaccessible, such that from a data integrating and mining perspective it is clearly a “starting community”. The field, and indeed the European open knowledge economy, requires conversion of these scientific discoveries into legislative frameworks and industrial applications, which can only be achieved through concerted efforts to integrate, consolidate, annotate and facilitate access to the disparate datasets. NanoCommons brings together academia, industry and regulators to facilitate pooling and harmonising of methods and data for modelling, safe-by-design product development and regulatory approval purposes, thereby driving best practice and ensuring maximum access to data and tools. Networking Activities span community needs assessment through development of demonstration case studies (e.g. exemplar regulatory dossiers). Joint Research Activities will integrate existing resources and organise efficient curation, preservation and facilitate access to data/models. Transnational Access will focus on standardisation of data generation workflows across the disparate communities and establishment of a common access procedure for transnational and/or virtual access to the data, and modelling and risk prediction/management tools developed and integrated. Given the extremely rapid pace of development of nanoinformatics, NanoCommons’s detailed workplan will be prescribed for the first 18 months, beyond which it will be co-developed with stakeholders on a rolling call basis to ensure maximum responsiveness to community needs.
The work in Maastricht focuses on community building, ontology work, and will undoubtedly link to Marvin's adverse outcome parthway work in EU-ToxRisk and OpenRiskNet. It will also reuse as much as possible the work done in the eNanoMapper project.

A Facebook page is being set up (but I don't use FB so do not know the link) but you can also follow the NanoCommons Twitter account for project updates. There is also this Scholia page, but that has even less to show at this moment.

Saturday, January 20, 2018

Winter solstice challenge #3: the winner is Bianca Kramer!

Part of the winning submission in the category 'best tool'.
A bit later than intended, but I am pleased to announce the winner of the Winter solstice challenge: Bianca Kramer! Of course, she was the only contender, but her solution is awesome! In fact, I am surprised no one took her took, ran it on their own data and just submit that (which was perfectly well within the scope of the challenge).

Best Tool: Bianca Kramer
The best tool (see the code snippet on the right) uses R and a few R packages (rorcid, rjson, httpcache) and services like ORCID and CrossRef (and the I4OC project), and the (also awesome) project. The code is available on GitHub.

Highest Open Knowledge Score: Bianca Kramer
I did not check the self-reported score of 54%, but since no one challenged here, Bianca wins this category too.

So, what next? First, start calculating your own Open Knowledge Scores. Just to be prepared for the next challenge in 11 months. Of course, there is still a lot to explore. For example, how far should we recurse with calculating this score? The following tweet by Daniel Gonzales visualizes the importance so clearly (go RT it!):

We have all been there, and I really think we should not teach our students it is normal that you have to trust your current read and no be able to look up details. I do not know how much time Gonzales spent on traversing this trail, but it must not take more than a minute, IMHO. Clearly, any paper in this trail that is not Open, will require a look up, and if your library does not have access, an ILL will make the traverse much, much longer. Unacceptable. And many seem to agree, because Sci-Hub seems to be getting more popular every day. About the latter, almost two years ago I wrote Sci-Hub: a sign on the wall, but not a new sign.

Of course, in the end, it is the scholars that should just make their knowledge open, so that every citizen can benefit from it (keep in mind, a European goal is to educate half the population with higher education, so half of the population is basically able to read primary literature!).

That completes the circle back to the winner. After all, Bianca Kramer has done really important work on how scientists can exactly do that: make their research open. I was shocked to see this morning that Bianca did not have a Scholia page yet, but that is fixed now (though far from complete):

Other papers that you should be read more include:
Congratulations, Bianca!

Sunday, January 14, 2018

Faceted browsing of chemicals in Wikidata

A few days ago Aidan and José introduced GraFa on the Wikidata mailing list. It is a faceted browser for content in Wikidata, and the screenshot on the right shows that for chemical compounds. They are welcoming feedback.

GraFa run on things of type chemical compound.
Besides this screenshot, I have not played with it a lot. It looks quite promising, and my initial feedback would be a feature to sort the results, and ability to export the full list to some other tool, e.g. download all those items as RDF.

Saturday, January 06, 2018

"All things must come to an end"

Cover of the book.
No worries, this is just about my Groovy Cheminformatics book. Seven years ago I started a project that was very educational to me: self-publishing a book. With the help from I managed to get a book out that sold over 100 copies and that was regularly updated. But their lies the problem: supply creates demand. So, I had a system that supplied me with an automated set up that reran scripts, recreated text output and even figures for the book (2D chemical diagrams). I wanted to make an edition for every CDK release. All in all, I got quite far with that: eleven editions.

But the current research setting, or at least in academia, does not provide me with the means to keep this going. Sad thing is, the hardest part is actually updating the graphics for the cover, which needs to resize each time the book gets ticker. But John Mayfield introduced so many API changes, I just did not have the time to update the book. I tried, and I have a twelfth edition on my desk. But where my automated setup scales quite nicely, I don't.

It may we worth reiterating why I started the book. We have had several places where information was given, and questions were answered: the mailing list, wiki pages, JavaDoc, the Chemistry Toolkit Rosetta Wiki, and more. Nothing in the book was not already answered somewhere else. The book was just a boon for me to answer those questions and provide an easy way for people to get many answers.

Now, because I could not keep up with the recent API changes, I am no longer feeling comfortable with releasing the book. As such, I have "retired" the book.

I am now working out on how to move from here. An earlier edition is already online under a Creative Commons license, and it's tempting to release the latest version like this too. That said, I have also been talking with the other CDK project leaders about alternatives. More on this soon, I guess.

Here's an overview of posts about the book: