## Saturday, August 18, 2018

### Compound (class) identifiers in Wikidata

 Bar chart showing the number of compoundswith a particular chemical identifier.
I think Wikidata is a groundbreaking project, which will have a major impact on science. One of the reasons is the open license (CCZero), the very basic approach (Wikibase), and the superb community around it. For example, setting up your own Wikibase including a cool SPARQL endpoint, is easily done with Docker.

Wikidata has many sub projects, such as WikiCite, which captures the collective of primary literature. Another one is the WikiProject Chemistry. The two nicely match up, I think, making a public database linking chemicals to literature (tho, very much needs to be done here), see my recent ICCS 2018 poster (doi:10.6084/m9.figshare.6356027.v1, paper pending).

But Wikidata is also a great resource for identifier mappings between chemical databases, something we need for our metabolism pathway research. The mapping, as you may know, are used in the latter via BridgeDb and we have been using Wikidata as one of three sources for some time now (the others being HMDB and ChEBI). WikiProject Chemistry has a related ChemID effort, and while the wiki page does not show much recent activity, there is actually a lot of ongoing effort (see plot). And I've been adding my bits.

But not each identifier in Wikidata has the same meaning. While they are all classified as 'external-id', the actual link may have different meaning. This, of course, is the essence of scientific lenses, see this post and the papers cited therein. One reason here is the difference in what entries in the various databases mean.

Wikidata has an extensive model, defined by the aforementioned WikiProject Chemistry. For example, it has different concepts for chemical compounds (in fact, the hierarchy is pretty rich) and compound classes. And these are differently modeled. Furthermore, it has a model that formalizes that things with a different InChI are different, but even allows things with the same InChI to be different, if need arises. It tries to accurately and precisely capture the certainty and uncertainty of the chemistry. As such, it is a powerful system to handle identifier mappings, because databases are not clear, and chemistry and biological in data is even less: we measure experimentally a characterization of chemicals, but what we put in databases and give names, are specific models (often chemical graphs).

That model differs from what other (chemical) databases use, or seem to use, because not always do databases indicate what they actually have in a record. But I think this is a fair guess.

ChEBI
ChEBI (and the matching ChEBI ID) has entries for chemical classes (e.g. fatty acid) and specific compounds (e.g. acetate).

PubChem, ChemSpider, UniChem
These three resources use the InChI as central asset. While they do not really have the concept of compound classes so much (though increasingly they have classifications), they do have entries where stereochemistry is undefined or unknown. Each one has their own way to link to other databases themselves, which normally includes tons of structure normalization (see e.g. doi:10.1186/s13321-018-0293-8 and doi:10.1186/s13321-015-0072-8)

HMDB
HMDB (and the matching P2057) has a biological perspective; the entries reflect the biology of a chemical. Therefore, for most compounds, they focus on the neutral forms of compounds. This makes linking to/from other databases where the compound is not neutral chemically less precise.

CAS registry numbers
CAS (and the matching P231) is pretty unique itself, and has identifiers for substances (see Q79529), much more than chemical compounds, and comes with a own set of unique features. For example, solutions of some compound, by design, have the same identifier. Previously, formaldehyde and formalin had different Wikipedia/Wikidata pages, both with the same CAS registry number.

Now, returning to our starting point: limitations in linking databases. If we want FAIR mappings, we need to be as precise as possible. Of course, that may mean we need more steps, but we can always simplify at will, but we never can have a computer make the links more complex (well, not without making assumptions, etc).

And that is why Wikidata is so suitable to link all these chemical databases: it can distinguish differences when needed, and make that explicit. It make mappings between the databases more FAIR.

## Thursday, August 09, 2018

### Alternative OpenAPIs around WikiPathways

I blogged in July about something I learned at a great Wikidata/ERC meeting in June: grlc. It's comparable to but different from the Open PHACTS API: it's a lot more general (and works with any SPARQL end point), but also does not have the identifier mapping service (based on BridgeDb) which we need to link the verious RDF data sets in Open PHACTS.

Of course, WikiPathways already has a OpenAPI and it's more powerful than we can do based on just the WikiPathways RDF (for various reasons), but the advantage is that you can expose any SPARQL query (see the examples at rdf.wikipathways.org) on the WikiPathways end point. As explained in July, you only have to set up a magic GitHub repository, and Chris suggested to show how this could be used to mimick some of the existing API methods.

The magic
The magic is defined in this GitHub repository, which currently exposes a single method:

#+ summary: Lists Organisms
#+ endpoint_in_url: False
#+ endpoint: http://sparql.wikipathways.org/
#+ tags:
#+   - Organism list

PREFIX rdfs:
PREFIX wp:

SELECT DISTINCT (str(?label) as ?organism)
WHERE {
?concept wp:organism ?organism ;
wp:organismName ?label .
}


The result
I run grlc in the normal way and point it to egonw/wp-rdf-api and the result looks like:

And executing the method in this GUI (click the light blue bar of the method), results in a nice CSV reply:

Of course, because there is SPARQL behind each method, you can make any query you like, creating any OpenAPI methods that fit your data analysis workflow.

## Wednesday, August 08, 2018

### Green Open Access: increase your Open Access rate; and why stick with the PDF?

 Icon of Unpaywall, a must have browser extension for the modern researcher.
Researchers of my generation (and earlier generations) have articles from the pre-Open Access era. Actually, I have even be tricked into closed access later; with a lot of pressure to publish as much as you can (which some see as a measure of your quality), it's impossible to not make an occasional misstep. But then there is Green Open Access (aka self-archiving), a concept I don't like, but is useful in those situations. One reason why I do not like it, is that there are many shades of green, and, yes, they all hurt: every journal has special rules. Fortunately, the brilliant SHERPA/RoMEO captures this.

Now, the second event that triggered this effort was my recent experience with Markdown (e.g. the eNanoMapper tutorials) and how platform like GitHub/GitLab built systems around it to publish this easily.

Why this matters to me? If I want to have my work have impact, I need people to be able to read my work. Open Access is one route. Of course, they can also email me for a copy the article, but I tend to be busy with getting new grants, supervision, etc. BTW, you can easily calculate your Open Access rate with ImpactStory, something you should try at least once in your life...

Step 1: identify which articles need an green Open Access version
Here, Unpaywall is the right tool, which does a brilliant job at identifying free versions. After all, one of your co-authors may already have self-archived it somewhere. So, yes, I do have a short list, one one of the papers was the second CDK paper (doi:10.2174/138161206777585274). The first CDK article was made CC-BY three years ago, with the ACS AuthorChoice program, but Current Pharmaceutical Design (CPD) does not have that option, as far as I know.

Step 2: check your author rights for green Open Access
The next step is to check SHERPA/RoMEO for your self-archiving rights. This is essential, as this is different for every journal; this is basically business model by obscurity, and without any standardization this is not FAIR in any way. For CDP it reports that I have quite a few rights (more than some bigger journals that still rely on Green to call themselves an "leading open access publisher", but also less than some others):

 SHERPA/RoMEO report for CPD.
Many journals do not allow you to self-archive the post-print version. And that sucks, because a preprint is often quite similar, but just not the same deal (which is exactly what closed access publishers want). But being able to post the post-print version is brilliant, because few people actually even kept the last submitted version (again, exactly what closed access publishers want). This report also tells you where you can archive it, and that is not always the same either: it's not uncommon that self-archiving on something like Mendeley or Zotero is not allowed.

Step 3: a post-print version that is not the publisher PDF??
Ah, so you know what version of the article you can archive, and where. But we cannot archive the publisher PDF. So, no downloading of the PDF from the publisher website and putting that online.

Step 4: a custom PDF
Because in this case we are allowed to archive the post-print version, I am allowed to copy/paste the content from the publisher PDF. I can just create a new Word/LibreOffice document with that content, removing the publisher layout and publisher content, and make a new PDF of that. A decent PDF reader allows you to copy/paste large amounts of content in one go, and Linux/Win10 users can use pdfimages to extract the images from the PDF for reuse.

Step 5: why stick with the PDF?
But why would we stick with a PDF? Why not use something more machine readable? Something where that support syntax highlighting, downloading of table content as CSV, etc? And that made me think of my recent experiments with Markdown.

So, I started of with making a Markdown version of the second CDK paper.

In this process, I:

1. removed hyphenation used to fit words/sentences nicely in PDF columns;
2. wrapped the code sections for syntax highlighting
3. recovered the images with pdfimages;
4. converted the table content to CSV (and used Markdown Tables Generator to create Markdown content) and added "Download as CSV" links to the table captions;
5. made the URLs clickable; and,
6. added ORCID icons for the authors (where known).
 Preview of the self-archived post-print of the second CDK article.
Step 6: tweet the free Green Open Access link
Of course, if no one knows about your effort, they cannot find your self-archived version. In due time, Google Scholar may pick it up, but I am not sure yet. Maybe (Bio)Schemas.org will help, but that is something I have yet to explore.

It's important to include the DOI URL in that link, so that the self-archived version will be linked to from services like Altmetric.com.