Sunday, December 27, 2015

The quality of SMILES strings in Wikidata

Russian Wikipedia on tungsten hexacarbonyl.
One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:
  1. SPARQL for all SMILES strings
  2. process each one of them with the CDK SMILES parser
I can do both easily in Bioclipse with an integrated script:

identifier = "P233" // SMILES
type = "smiles"

sparql = """
PREFIX wdt: <>
SELECT ?compound ?smiles WHERE {
  ?compound wdt:P233 ?smiles .
mappings = rdf.sparqlRemote("", sparql)

outFilename = "/Wikidata/badWikidataSMILES.txt"
if (ui.fileExists(outFilename)) ui.remove(outFilename)
fileContent = ""
for (i=1; i<=mappings.rowCount; i++) {
  try {
    wdID = mappings.get(i, "compound")
    smiles = mappings.get(i, "smiles")
    mol = cdk.fromSMILES(smiles)
  } catch (Throwable exception) {
    fileContent += (wdID + "," + smiles + ": " +

                   exception.message + "\n")
  if (i % 1000 == 0) js.say("" + i)
ui.append(outFilename, fileContent)

It turns out that out of the more than 16 thousand SMILES strings in Wikidata, only 42 could not be parsed. That does not mean they are correct, but it does mean the are wrong. Many of them turned out to be imported from the Russian Wikipedia, which is nice, as it gives me the opportunite to work in that Wikipedia instance too :)

At this moment, some 19 SMILES still need fixing (the list will chance over time, so by the time you read this...):

Tuesday, December 22, 2015

New Edition! Getting CAS registry numbers out of WikiData

Source: Wikipedia. CC-BY-SA

April this year I blogged about an important SPARQL query for many chemists: getting CAS registry numbers from Wikidata. This is relevant for two reasons:
  1. CAS works together with Wikimedia on a large, free CAS-to-structure database
  2. Wikidata is CCZero
The original effort validated about eight thousand registry numbers, made available via Wikipedia and the Common Chemistry website. However, the effort did not stop there, and Wikipedia now contains many more CAS registry numbers. In fact, Wikidata picked up many of these and now lists almost twenty thousand CAS numbers. That well exceeds what databases are allowed to aggregate and make available.

Since the post in April, Wikidata put online a new SPARQL end point and created "direct" property links. This way, you loose the provenance information, but the query becomes simpler:
    PREFIX wdt: <>
    SELECT ?compound ?id WHERE {
      ?compound wdt:P231 ?id .
The other thing that changed since April is that others and I requested the creation of more compound identifiers, and here's an overview along with the current number of such identifiers in Wikidata:
Clearly, some identifiers are not well populated yet. This is what bots are for, like those used by the Andrew Su team.

Because there is also a predicate for SMILES, we can also create a query that puts the CAS registry number alongside to the SMILES (or any other identifier):
    PREFIX wdt: <>
    SELECT ?compound ?id ?smiles WHERE {
      ?compound wdt:P231 ?id ;
                wdt:P233 ?smiles .
Of course, then the question is, are these SMILES string valid...And, importantly, this is nothing compared to the number of chemical compounds we know about, which currently is in the order of 100 million, of which a quarter can be readily purchased:

Willighagen, E., 2015. Getting CAS registry numbers out of WikiData. The Winnower.

Using the WikiPathways API in R

Colored pathways created with
the new R package.
Earlier this week there was a question on the WikiPathways mailing list about the webservices. There are older SOAP webservices and newer REST-like webservices, which come with this nice Swagger webfront set up by Nuno. Of course, both approaches are pretty standard and you can use them from basically any environment. Still, some personas prefer to not see technical issues: "why should I know how an car engine works". I do not think any scholar is allowed you use this argument, but alas...

Of course, hiding those details is not so much of an issue, and since I have made so many R packages in the past, I decided to pick up the request to create an R package for WikiPathways: rWikiPathways. It is not feature complete yet, and not extensively tested in daily use yet (there is a test suite). But here are some code examples. Listing pathways and organisms in the wiki is done with:
    organisms = listOrganisms()
    pathways = listPathways()
    humanPathways = listPathways(organism="Homo sapiens")
For the technology oriented users, for each pathway, you have access to the GPML source file for each pathway:
    gpml = getPathway(pathway="WP4")
    gpml = getPathway(pathway="WP4", revision=83654)
However, most use will likely be via database identifiers for genes, proteins, and metabolites, called Xrefs (also check out the R package for BridgeDb):
    xrefs = getXrefList(pathway="WP2338", systemCode="S")
    pathways = findPathwaysByXref("HMDB00001", "Ch")
    pathways = findPathwaysByXref(identifier="HMDB00001", systemCode="Ch")
    pathways = findPathwaysByXref(
      identifier=c("HMDB00001", "HMDB00002"),
      systemCode=c("Ch", "Ch") 
Of course, these are just the basics, and the question was about colored pathways. The SOAP code was a bit more elaborate, and this is the version with this new package (the result at the top of this post):
    svg = getColoredPathway(pathway="WP1842", graphId=c("dd68a","a2c17"),
      color=c("FF0000", "00FF00"));
    writeLines(svg, "pathway.svg")
If you use this package in your research, please cite the below WikiPathways paper. If you have feature requests, please post the in the issue tracker.

Kutmon, M., Riutta, A., Nunes, N., Hanspers, K., Willighagen, E. L., Bohler, A., Mélius, J., Waagmeester, A., Sinha, S. R., Miller, R., Coort, S. L., Cirillo, E., Smeets, B., Evelo, C. T., Pico, A. R., Oct. 2015. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research.

Sunday, December 13, 2015

SWAT4LS in Cambridge

Wordle of the #swat4ls tweets.
Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

I have mixed feelings about missing half of the workshops on Monday for a visit of one of our Open PHACTS partners, but do not regret that meeting at all; I just wish I could have done both. During the visit we spoke particularly about WikiPathways and our collaboration in this area.

The Monday morning workshops were cool. First, Evan Bolton and Gang Fu gave an overview of their PubChemRDF work. I have been involved in that in the past, and I greatly enjoyed seeing the progress they have made, and a rich overview of the 250GB of data they make available on their FTP side (sorry, the rights info has not gotten any clearer over the years, but generally considered "open"). The RDF now covers, for example, the biosystems module too, so that I can query PubChem for all compounds in WikiPathways (and compare that against internal efforts).

The second workshop I attended was by Andra and others about Wikidata. The room, about 50 people, all started editing Wikidata, in trade of a chocolate letter:

The editing was about prevalence is two diseases. Both topics continued during the hackathon, see below. Slides of this presentation are online. But I missed the DisGeNET workshop, unfortunately :(

The conference itself (in the new part of Clare College, even the conference dinner) started on the second day, and all presentations are backed by a paper, linked from the program. Not having attended a semantic web conference in the past 2~ish years, it was nice to see the progress in the field. Some papers I found interesting:
But the rest is most worthwhile checking out too! The Webulous I as able to get going with some help (not paying enough attention to the GUI) for eNanoMapper:

A Google Spreadsheet where I restricted the content of a set of cells to only subclasses of the "nanomaterial" class in the eNanoMapper ontology (see doi:10.1186/s13326-015-0005-5).
The conference ended with a panel discussion, and despite our efforts of me and the other panel members (Frank Gibson – Royal Society of Chemistry, Harold Solbrig – Mayo Clinic, Jun Zhao, University of Oxford), it took long before the conference audience really started joining in. Partly this was because the conference organization asked the community for questions, and the questions clearly did not resonate with the audience. It was not until we started discussing publishing that it became more lively. My point there was I believe the semantic web applications and tools are not really a rate limiting factor anymore, and if we really want to make a difference, we really must start changing the publishing industry. This has been said by me and others for many years already, but the pace at which things change it too low. Someone mentioned a chicken-and-egg situation, but I really believe it is all just a choice we make and an easy solution: pick up a knife, kill the chicken, and have a nice dinner. It is annoying to see all the great efforts at this conference, but much of it limited because our writing style makes nice stories and yields few machine readable facts.

The hackathon was held at the EBI in Hinxton (south/elixir building) and during the meeting I had a hard time deciding what to hack on: there just were too many interesting technologies to work on, but I ended up working on PubChem/HDT (long) and Wikidata (short). The timings are based on the amount of help I needed to bootstrap things and how much I can figure out at home (which is a lot for Wikidata).

HDT (header, dictionary, triple) is a not-so-new-but-under-the-radar technology for binary storing triples in a file based store. The specification outlines this binary format as well as the index. That means that you can share triple data compressed and indexed. That opens up new possibilities. One thing I am interested in, is using this approach for sharing link sets (doi:10.1007/978-3-319-11964-9_7) for BridgeDb, our identifier mapping platform. But there is much more, of course: share life science databases on your laptop.

This hack was triggered by a beer with Evan Bolton and Arto Bendiken. Yes, there is a Java library, hdt-java, and for me the easiest way to work out how to use a Java API, is to write a Bioclipse plugin. Writing the plugin is trivial, though setting up a Bioclipse development is less so: the New Wizard does the hard work in seconds. But then started the dependency hacking. The Jena version it depended on is incompatible with the version in Bioclipse right now, but that is not a big deal for Eclipse, and the outcome is that we have both version on the classpath :) That, however, did require me to introduce a new plugin, net.bioclipse.rdf.core with the IRDFStore, something I wanted to do for a long time, because that is also needed if one wants to use Sesame/OpenRDF instead of Jena.

So, after lunch I was done with the code cleanup, and I got to the HDT manager again. Soon, I could open a HDT file. I first had the API method to read it into memory, but that's not what I wanted, because I want to open large HDT files. Because it uses Jena, it conveniently provides a Jena Model object, so adding SPARQL-ing support was easy; I cannot use the old SPARQL-ing code, because then I would start mixing Jena versions, but since all is Open Source, I just copied/pasted the code (which is written by me in the first place, doi:10.1186/2041-1480-2-s1-s6, interestingly, work that originates from my previous SWAT4LS talk :). Then, I could do this:
It is file based, which has different from a full triple store server. So, questions arise about performance. Creating an index takes time and memory (1GB of heap space, for example). However, the index file can be shared (downloaded) and then a HDT file "opens" in a second in Bioclipse. Of course, the opening does not do anything special, like loading into memory, and should be compared to connecting to a relational database. The querying is what takes the time. Here are some numbers for the Wiktionary data that the RDFHDT team provides as example data set:
However, I am not entirely sure what to compare this against. I will have to experiment with, for example, ChEMBL-RDF (maybe update the Uppsala version, see doi:10.1186/1758-2946-5-23). The advantage would be that ChEMBL data could easily be distributed along with Bioclipse to service the decision support features. Because the typical query is asking for data for a parcicular compound, not all compounds. If that works within less than 0.1 seconds, then this may give a nice user experience.

But before I reach that, it needs a bit more hacking:
  1. take the approach I took with BridgeDb mapping databases for sharing HDT files (which has the advantage that you get a decent automatic updating system, etc)
  2. ensure I can query over more than one HDT file
And probably a bit more.

Wikidata and WikiPathways
After the coffee break I joined the Wikidata people, and sat down to learn about the bots. However, Andra wanted to finish something else first, where I could help out. Considering I probably manage to hack up a bot anyway, we worked on the following. Multiple database about genes, proteins, and metabolites like to link these biological entities to pathways in WikiPathways (doi:10.1093/nar/gkv1024). Of course, we love to collaborate with all the projects that integrate WikiPathways into their systems, but I personally rather use a solution that services all needs. If only because then people can do this integration without needing our time. Of course, this is an idea we pitched about a year ago in the Enabling Open Science: WikiData for Research proposal (doi:10.5281/zenodo.13906).

That is, would it not be nice of people can just pulled the links between the biological entities to WikiPathways from Wikidata, using one of the many APIs they have (SPARQL, REST), supporting multiple formats (XML, JSON, RDF)? I think so, as you might have guessed. So does Andra, and he asked me if I could start the discussions in the Wikidata community, which I happily did. I'm not sure about the outcome, because despite having links like these is not of their prime interest - they did not like the idea of links to the Crystallography Open Database much yet, with the argument it is a one-to-many relation - though this is exactly what the PDB identifier is too, and that is accepted. So, it's a matter of notability again. But this is what the current proposal looks like:

Let's see how the discussion unfolds. Please feel tree to coin in and show your support, comments, questions, or opposition, so that we can together get this right.

Chemistry Development Kit
There is undoubtedly a lot more, but I have been summarizing the meeting for about three hours now, getting notes together etc. A last thing I want to mention now, is the CDK. Cheminformatics is, afterall, a critical feature of life science data, and spoke with a few about the CDK. And I visited NextMove Software on Friday where John May works nowadays, who did a lot of work on the CDK recently (we also spoke about WikiPathways and eNanoMapper). NextMove is doing great stuff (thanks for the invitation), and so did John during his PhD in Chris Steinbeck's group at the EBI. But during the conference I also spoke with others about the CDK and following up on these conversations.

Tuesday, November 24, 2015

Databasing nanomaterials: substance APIs

Cell uptake of gold nanoparticles
in human cells. Source. CC-BY 4.0
Nanomaterials are quite interesting from a science perspective: first, they are materials and not so well-defined as such. The can best be described as a distribution of similar nanoparticles. That is, unlike small compounds, which we commonly describe as pure materials. Nanomaterials have a size distribution, surface differences, etc. But akin the QSAR paradigm, because they are similar enough, we can expect similar interaction effects, and thus treat them as the same. A nanomaterials is basically a large collection of similar nanoparticles.

Until the start interacting, of course. Cell membrane penetration is studies at a single nanoparticle level, and they make interesting pictures of that (see top left). Or when we do computation. Then too, we typically study a single materials. On the other hand, many nanosafety studies work with the materials, at a certain dosage. Study cell death, transcriptional changes, etc, when the materials is brought into contact with some biosample.

The synthesis is equally interesting. Because of the nature of many manufacturing processes (and the literature synthesizing new materials is enormous), it is typically not well understood what the nanomaterial or even nanoparticle looks like. This is overcome by stydying the bulk properties, and report some physicochemical properties, like the size distribution, with methods like DLS and TEM. The field just lacks the equivalent of what NMR is for (small) (organic) compounds.

Now, try capturing this in a unified database. That's exactly what eNanoMapper is doing. And with a modern approach. It's a database project, not a website proejct. We develop APIs and test all aspects of the database extensively using test data. Of course, using the API we can easily create websites (there currently are JavaScript and R client libraries), and we have done so at It's great to be working with so many great domain specialists who get things done!

There is a lot to write and discuss about this, but end now by just pointing you to our recent paper outlining much of the cheminformatics of this new nanosafety database solution.

Of course, we study in our group the nanosafety and nanoresponse (think nanomedicine) at a systems biology level. So, here's the obligatory screenshot of work of one of of interns (Stan van Roij). Not fully integrated with the database yet, though.

Jeliazkova, N., Chomenidis, C., Doganis, P., Fadeel, B., Grafström, R., Hardy, B., Hastings, J., Hegi, M., Jeliazkov, V., Kochev, N., Kohonen, P., Munteanu, C. R., Sarimveis, H., Smeets, B., Sopasakis, P., Tsiliki, G., Vorgrimmler, D., Willighagen, E., Jul. 2015. The eNanoMapper database for nanomaterial safety information. Beilstein Journal of Nanotechnology 6, 1609-1634.

Sunday, November 22, 2015

Twitter at conferences

I have been happily tweeting the BioMedBridges meeting in Hinxton last week using the #lifesciencedata hashtag, along with more than 100 others, though a small subset was really active. A lot has been published about using Twitter at conference, like the recent paper by Ekins et al (doi:10.1371/journal.pcbi.1003789).

The backchannel discussions only get better when more and more people join, and when complementary information is passed around. For example, I tend to tweet links to papers that appear on slides, chemical/protein structure mentioned, etc. I have also started tweeting ORCID identifiers of the speakers if I can find them, in addition to adding them to a Lanyrd page.

Like at most meetings, people ask me about this tweeting. Why I do it? Doesn't it distract you from the presentation? I understand these questions.

First, I started recording my notes of meetings electronically during my PhD thesis, because I needed to write a summary of each meeting for my funder. So, when Twitter came along, and after I had already built up some experience blogging summaries of meetings, I realized that I might as well tweet my notes. And since I was looking up DOIs of papers anyway, the step was not big. The effect, however, was use. People started replying, some at the conference itself, some not. This resulted in a lot of meetings with people at the conference. Tweetups do not regularly happen anymore, but it's a great first line for people, "hey, aren't you doing all that blogging", and before you know it, you are talking science.

Second, no, it does not significantly distract me from the talk. First, like listening to a radio while studying, it keeps me focused. Yes, I very much think this differs from person to person, and I am not implying that it generally is not distracting. But it keeps me busy, which is very useful during some talks, when people in the audience otherwise start reading email. If I look up details (papers, project websites, etc) from the talk, I doubt I am more distracted than some others.

Third: what about keeping up. Yes, that's a hard one, and I was beaten in coverage speed by others during this meeting. That was new to me, but I liked that. Sadly, some of the most active people left the meeting after the first day. So, I was wondering how I could speed up my tweeting, or, alternatively, how it would take me less time so that I can read more of the other tweets. Obvious candidates are blogging additional information like papers, etc.

So, I started working on some R code to help me tweet faster, and using the great collection of rOpenSci packates, I have come up with the following two first helper methods. In both examples, I am using an #example hashtag.

Tweeting papers
This makes use of the rcrossref package to fetch the name of the first author and title of the paper.

Tweeting speakers
Or perhaps, tweeting people. This bit of code makes use of the rorcid package.

Of course, you are most interesting in the code than the screenshots, so here it is (public domain; I may make a package out of this):


setup_twitter_oauth("YOURINFO", "YOURINFO")

tweetAuthor = function(orcid=NULL, hashtag=NULL) {
  person = as.orcid(orcid)
  firstName = person[[1]]$"orcid-bio"$`personal-details`$`given-names`$value
  surname = person[[1]]$"orcid-bio"$`personal-details`$`family-name`$value
  orcidURL = person[[1]]$"orcid-identifier"$uri
    paste(firstName, " ", surname, " orcid:",

    orcid, " ", orcidURL, " #", hashtag, sep="")
tweetPaper = function(doi=NULL, hashtag=NULL) {
  info = cr_cn(dois=doi, format="citeproc-json")
      info$author[[1]]$family, " et al. \"",
      substr(info$title, 0, 60), "...\" ",
      "", info$DOI, " #", hashtag, sep=""

Getting your twitteR to work (the authentication, that is) may be the hardest part. I do plan to add further methods like: tweetCompound(), tweetProtein(), tweetGene(), etc...

Got access to literature?

Got access to literature? Only yesterday I discovered that resolving some Nature Publishing Group DOIs do not necessarily lead to useful information. High quality metadata about literature is critical for the future of science. Elsevier just showed how creative publishers can be in interpreting laws and licenses (doi:10.1038/527413f).

So, it may be interesting to regularly check your machine readable Open Access metadata. ImpactStory helps here with their Open Access Badge. New to me was what Daniel pointed me to: dissemin (@disseminOA). Just pass your ORCID and you end up with a nice overview of what the world knows about the open/closed status of your output.

I would not say my report is flawless, but that nicely shows how important it is to get this flow of metadata right! For example, there are some data sets and abstracts detected as publications; fairly, I think this is to a good extend my inability to annotate them properly in my ORCID profile.

WikiPathways: capturing the full diversity of pathway knowledge

Figure from the new NAR paper.
Biology is a complex matter. The biological matter indeed involves many different chemicals in very many temporospatial forms: small compounds may be present in different charge states (proteins too, of course), tautomers, etc. Proteins may exhibit isoforms, various post-translational modifications, etc. Genes shows structures we are only now starting to see: the complex structures in the nucleus have been invisible to mankind until some time ago. Likewise, the biological processes, encoded as pathways, cover an equal amount of complexity.

WikiPathways is a community run pathway database, similar to others like KEGG, Reactome, and many others. One striking difference is the community approach of WikiPathways: anyone can work on or extend the content of the database. This makes WikiPathways exciting to me: it encodes very different bits of biological knowledge, and a key reason why I joined Chris Evelo's team almost four years ago. Importantly, this community is supported by a lively and reasonably sized (>10 people and growing) curation team, primarily located at Maastricht University and the Gladstone Institutes.

The newest paper in NAR (doi:10.1093/nar/gkv1024) outlines some recent developments and the growth of the database. There is still so much to do, and given the current speed at which we learn new biological patterns, this will not get less soon.

Want to help? Sign up, enlist your ORCID! Need ideas what you can do? Why not take a recent paper you published (or read), take a new biological insight, look up an appropriate pathway and add that paper. If you have a novel pathway or important new insight in a biological paper published, why not convert that figure from that paper into a machine readable pathway?

Kutmon, M., Riutta, A., Nunes, N., Hanspers, K., Willighagen, E. L., Bohler, A., Mélius, J., Waagmeester, A., Sinha, S. R., Miller, R., Coort, S. L., Cirillo, E., Smeets, B., Evelo, C. T., Pico, A. R., Oct. 2015. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research.

RRegrs: exploring the space of possible regression models

Machine learning is a field of science that focusses on mathematically describing patterns in data. Chemometrics does this for chemical data. Examples are (nano)QSAR where structural information is related to biological activity. I studied during my PhD studies the interaction between the statistics and machine learning with how you computationally (numerically) represent the question. The right combination is not obvious and it has become common to try various modelling methods, though something with support vector machines (SVM/SVR) and more recently neural networks (deep learning) have become popular. A simpler model, however, has its benefits too and frequently not significantly worse than more complex models. That said, exploring all machine learning methods manually takes a lot of time, as each comes with its own parameters which need varying.

Georgia Tsiliki (NTUA partner in eNanoMapper), Cristian Munteany (former postdoc in our group), and others developed RRegrs, an R package to explore the various models and automatically calculate a number of statistics to allow to compare them (doi:10.1186/s13321-015-0094-2). That said, following my thesis, you must never rely on performance statistics, but the output of RRegrs may help you explore the full set of models.

Tsiliki, G., Munteanu, C. R., Seoane, J. A., Fernandez-Lozano, C., Sarimveis, H., Willighagen, E. L., Sep. 2015. RRegrs: an r package for computer-aided model selection with multiple regression models. Journal of Cheminformatics 7 (1), 46.

Saturday, October 31, 2015

So, now you have SMILES that are faulty... visualize them?

So, you validated your list of SMILES in the paper you were planning to use (or about to submit), and you found a shortlist of SMILES strings that do not look right. Well, let's visualize them.

We all used to use the Daylight Depict tool, but this is no longer online. I blogged previously already about using AMBIT for SMILES depiction (which uses various tools for depiction; doi:10.1186/1758-2946-3-18), but now John May released a CDK-only tool, called CDK Depict. The download section offers a jar file and a war for easy deployment in a Tomcat environment. But for the impatient, there is also this online host where you can give it a try (it may go offline at some point?).

Just copy/paste your shortlist there, and visually see what is wrong with them :) Big HT to John for doing all these awesome things!

How to test SMILES strings in Supplementary Information

Source. License: CC-BY 2.0.
When you stumble upon a nice paper describing a new predictive or explanatory model for a property or a class of compounds that has your interest, the first thing you do is test the training data. For example, validating SMILES (or OpenSMILES) strings in such data files is now easy with the many Open Source tools that can parse SMILES strings: the Chemistry Toolkit Rosetta provides many pointers for parsing SMILES strings. I previously blogged about a CDK/Groovy approach.

Cheminformatics toolkits need to understand what the input is, in order to correctly calculate descriptors. So, let's start there. It does not matter so much which toolkit you use and I will use the Chemistry Development Kit (doi:10.1021/ci025584y) here to illustrate the approach.

Let's assume we have a tab-separated values file, with the compound identifier in the first column and the SMILES in the second column. That can easily be parsed in Groovy. For each SMILES we parse it and determine the CDK atom types. For validation of the supplementary information we only want to report the fails, but let's first show all atom types:

import org.openscience.cdk.smiles.SmilesParser;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
import org.openscience.cdk.atomtype.CDKAtomTypeMatcher;

parser = new SmilesParser(
matcher = CDKAtomTypeMatcher.getInstance(

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") { // header line
    mol = parser.parseSmiles(smiles)
    println "$id -> $smiles";

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        report += "  no CDK atom type\n"
      } else {
        println "  atom type: " + type.atomTypeName

This gives output like:

mo1 -> COC
  atom type: C.sp3
  atom type: O.sp3
  atom type: C.sp3

If we rather only report the errors, we make some small modifications and do something like:

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)
    errors = 0
    report = ""

    // check CDK atom types
    types = matcher.findMatchingAtomTypes(mol);
    types.each { type ->
      if (type == null) {
        errors += 1;
        report += "  no CDK atom type\n"

    // report
    if (errors > 0) {
      println "$id -> $smiles";
      print report;

Alternatively, you can use the InChI library to do such checking. And here too, we will use the CDK and the CDK-InChI integration (doi:10.1186/1758-2946-5-14).

factory = InChIGeneratorFactory.getInstance();

new File("suppinfo.tsv").eachLine { line ->
  fields = line.split(/\t/)
  id = fields[0]
  smiles = fields[1]
  if (smiles != "SMILES") {
    mol = parser.parseSmiles(smiles)

    // check InChI warnings
    generator = factory.getInChIGenerator(mol);
    if (generator.returnStatus != INCHI_RET.OKAY) {
      println "$id -> $smiles";
      println generator.message;

The advantage of doing this, is that it will also give warnings about stereochemistry, like:

mol2 -> BrC(I)(F)Cl
  Omitted undefined stereo

I hope this gives you some ideas on what to do with content in supplementary information of QSAR papers. Of course, this works just as well for MDL molfiles. What kind of validation do you normally do?

Sunday, September 27, 2015

Coding an OWL ontology in HTML5 and RDFa

There are many fancy tools to edit ontologies. I like simple editors, like nano. And like any hacker, I can hack OWL ontologies in nano. The hacking implies OWL was never meant to be hacked on a simple text editor; I am not sure that is really true. Anyways, HTML5 and RDFa will do fine, and here is a brief write up. This post will not cover the basics of RDFa and does assume you already know how triples work. If not, read this RDFa primer first.

The BridgeDb DataSource Ontology
This example uses the BridgeDb DataSource Ontology, created by BridgeDb developers from Manchester University (Christian, Stian, and Alasdair). The ontology covers describing data sources of identifiers, a technology outlined in the BridgeDb paper by Martijn (see below) as well as terms from the Open PHACTS Dataset Descriptions for the Open Pharmacological Space by Alasdair et al.

Because I needed to put this online for Open PHACTS (BTW, the project won a big award!) and our previous solution did not work well enough anymore. You may also see the HTML of the result first. You may also want to verify it really is HTML: here is the HTML5 validation report. Also, you may be interested in what the ontology in RDF looks like: here is the extracted RDF for the ontology. Now follow the HTML+RDFa snippets. First, the ontology details (actually, I have it split up):

<div about=""
  <h1>The <span property="rdfs:label">BridgeDb DataSource Ontology</span>
    (version <span property="owl:versionInfo">2.1.0</span>)</h1>
    This page describes the BridgeDb ontology. Make sure to visit our
    <a property="rdfs:seeAlso" href="">homepage</a> too!
<p about="">
  The OWL ontology can be extracted
  <a property="owl:versionIRI"
  The Open PHACTS specification on
  <a property="rdf:seeAlso"
  >Dataset Descriptions</a> is also useful.

This is the last time I show the color coding, but for a first time it is useful. In red are basically the predicates, where @about indicates a new resource is started, @typeof defines the rdf:type, and @property indicates all other predicates. The blue and green blobs are literals and object resources, respectively. If you work this out, you get this OWL code (more or less):

bridgedb: a owl:Ontology;
  rdfs:label "BridgeDb DataSource Ontology"@en;
  rdfs:seeAlso <>;
  owl:versionInfo "2.1.0"@en .

An OWL class
Defining OWL classes are using the same approach: define the resource it is @about, define the @typeOf and giving is properties. BTW, note that I added a @id so that ontology terms can be looked up using the HTML # functionality. For example:

<div id="DataSource"
  <h3 property="rdfs:label">Data Source</h3>
  <p property="dc:description">A resource that defines
    identifiers for some biological entity, like a gene,
    protein, or metabolite.</p>

An OWL object property
Defining an OWL data property is pretty much the same, but note that we can arbitrary add additional things, making use of <span>, <div>, and <p> elements. The following example also defines the rdfs:domain and rdfs:range:

<div id="aboutOrganism"
  <h3 property="rdfs:label">About Organism</h3>
  <p><span property="dc:description">Organism for all entities
    with identifiers from this datasource.</span>
    This property has
    <a property="rdfs:domain"
    as domain and
    <a property="rdfs:range"
    as range.</p>

So, now anyone can host an OWL ontology with dereferencable terms: to remove confusion, I have used the full URLs of the terms in @about attributes.

 Van Iersel, M. P., Pico, A. R., Kelder, T., Gao, J., Ho, I., Hanspers, K., Conklin, B. R., Evelo, C. T., Jan. 2010. The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11 (1), 5+.

Saturday, September 19, 2015

#Altmetrics on CiteULike entries in R

I wanted to know when a set of publications I was aggregating on CiteULike was published. The number of publications per year, for example. I did a quick Google but could not find an R package to client to the CiteULike API, and because I wanted to play with JSON in R anyway, I created a citeuliker package. Because I'm a liker of CiteULike (see these posts). Well, to me that makes sense.

citeuliker uses jsonlite, plyr, and curl (and testthat for testing). The first converts the JSON returned by the API to a R data structure. The package unfolds the "published" field, so that I can more easily plot things by year. I use this code for that:
    data[,"year"] <- laply(data[,"published"], function(x) {
      if (length(x) < 1) return(NA) else return(x[1])
The laply() method comes from the plyr package. For example, if I want to see when the publications were published that I collected in my CiteULike library, I type:
That then looks like the plot in the top-right of this post. And, yes, I have a publication from 1777 in my library :) See the reference at the bottom of this page.

Getting all the DOIs from my library is trivial too now:
    data <- citeuliker::getData(user="egonw")
    doi <- as.vector(na.omit(data[,"doi"]))
I guess the as.vector() to remove attributes can be done more efficiently; suggestions welcome.

Now, this makes it really easy to aggregate #altmetrics, because the rOpenSci people provide the rAltmetric package, and I can simply do (continuing from the above):
    library(rAltmetric) acuna <- altmetrics(doi=dois[6]);
    acuna_data <- altmetric_data(acuna);

And then I get something like this:

Following the tutorial, I can easily get #altmetrics for all my DOIs, and plot a histogram of my Altmetric scores (make sure you have the plyr library loaded):
    raw_metrics <- lapply(dois, function(x) altmetrics(doi = x))
    metric_data <- ldply(raw_metrics, altmetric_data
    hist(metric_data$score, main="Altmetric scores", xlab="score")
That gives me this follow distribution:

The percentile statistics are also useful to me. After all, there is a clear pressure to have impact with your research. Getting your research known is a first step there. That's why we submit abstracts for orals and posters too. Advertisement. Anyway, there is enough to be said about how useful #altmetrics are, and my main interest is in using them to see what people say about that, but I don't have time now to do anything with that (it's about time for dinner and Dr. Who).

But, as a last plot, and happy my online presence is useful for something, here a plot of the percentile of my papers in the journal it was published in and for the full corpus:
      xlab="pct all", ylab="pct journal"
This is the result:

This figure shows that my social campaign puts many of my publications in the top 10. That's a start. Of course, these do not link one-to-one to citations, which are valued more by many, even though it also does not reflect well the true impact. Sadly, scientists here commonly ignore that the citation count also includes cito:disagreesWith and cito:citesAsAuthority.

Anyways... I think I need other R packages for getting citation counts from Google Scholar, Web of Science, and Scopus.

Scheele, C. W., 1777. Chemische Abhandlung von der Luft und dem Feuer.
Mietchen, D., Others, M., Anonymous, Hagedorn, G., Jan. 2015. Enabling open science: Wikidata for research.

Sunday, August 30, 2015

Pimped website: HTML5, still with RDFa, restructuring and a slidebar!

My son did some HTML, CSS, JavaScript, and jQuery courses at Codecademy recently. Good for me: he pimped my personal website:

Of course, he used GitHub and pull requests (he had been using git for a few years already). His work:

  • fixed the columns to properly resize
  • added a section with my latest tweets
  • added menus for easier navigating the information
  • made section fold and unfold (most are now folded by default)
  • added a slide bar, which I use to highlight some recent output
Myself, I upgraded the website to HTML5. It used to be XHTML, but it seems XHTML+RDFa is not really established yet; or, at least, there is no good validator. So, it's now HTML5+RDFa (validation report; currently one bug). Furthermore, I updated the content and gave the first few collaborators ORCID ids, which are now linked as owl:sameAs in the RDF to the foaf:Person (RDF triples extracted from this page).

Linking papers to database to papers: PubMed Commons and

I argued earlier this year (doi:10.5281/zenodo.17892) in the Journal of Brief Ideas that measuring reuse of data and/or results in databases is a good measure of impact of that research. Who knows, it may even beat the citation count, which does not measure quality or correctness of data (e.g. you may cite a paper because you disagree with the content; I have long and still am advocating the Citation Typing Ontology).

But making the link between databases and papers is not only benefiting measuring reuse, it is also just critical for doing research. Without clear links, finding answers is hard. I experience that myself frequently, and so do others, like Christopher Southan, and it puzzles me that so few people worry about this. Of course, databases do a good part of linking, but only if they expose an API (still rare, but upcoming), it is hard to use these links. PubMed Commons can be used to link to (machine readable) version of data in a paper. See, for example, these four comments by me.

Better is when the database provides an API. And that is used by Ferret. I have no idea where this project is going to; it does not seem Open Source, I am not entirely sure how the implemented the history, but the idea is interesting. Not novel, as UtopiaDocs does a similar thing. Difference is, Ferret is not a PDF reader, but works directly in your Chrome browser. That makes it more powerful, but also more scary, which is why it is critical they send a clear message about any involvement of Ferret servers, or if everything is done locally (otherwise they can forget about (pharma) company uptake, and they'd have a hard time restoring trust). That said, there privacy policy document is already quite informative!

Last week, I asked them about their tool and if it was hard to add databases, as that is one thing Ferret does: if you open it up for a paper, it will show the databases that cite that paper (and thus likely have information or data from that paper, e.g. supplementary information). Here's an example:

This screenshots shows the results for a nanotoxicity paper and we see it picked up "titanium oxide" (accurately picking up actual nanomaterials or nanoparticles is an unsolved text mining issue). We get some impact statistics, but if you read my blog and my brief idea about capturing reuse, I think they got "impact" wrong. Anyway, they do have a knowledge graph section, which has the paper-database links, and Ferret found this paper cited in UniProt.

Thus, when I asked them if it would be hard to add new databases to that section, and I mentioned Open PHACTS and WikiPathways, they replied. In fact, within hours they told me they found the WikiPathways SPARQL end point that Andra started, which they find easier to use than the WikiPathways webservices :)  They asked me for a webpage to point users too, and while I was thinking about that, they found another WikiPathways trick I did not know about, you can browse for WP2371 OR WP2059. Tina then replied that, given a PubMed ID, there was even a nicer way, just browse for all pathways with a particular PubMed ID.

Well, a bit later, they release Ferret 0.4.2 with WikiPathways support. The below screenshot shows the output for a paper (doi:10.2174/1389200214666131118234138) by Rianne (who did internships in our group, and now does here PhD in toxicology):

The Ferret infobar shows seventeen WikiPathways that are linked to this paper, which happens to be the collection that Rianne made during her internship leading to this paper, and uploaded to WikiPathways some months ago. Earlier this year we sat down with her, Freddie, and Linda to make them more machine readable. This is what this list looks like in the browse functionality:

Ferret version 0.4.2 did not work for me, but they fixed the issue within a day, and the above screenshot was made with version 0.4.3. So, besides like a bunch of good hackers, they also seem to listen to their customers. So, what databases do you feel they should add? Leave a comment here, or tweet them at @getferret (pls cc me).

Willighagen, E., Capturing reuse in altmetrics. J. Brief Ideas. May 2015. URL
Fijten, R. R. R., Jennen, D. G. J., Delft, Dec. 2013. Pathways for ligand activated nuclear receptors to unravel the genomic responses induced by hepatotoxicants. Current Drug Metabolism, 1022-1028.

Journal of Brief Ideas: an excellent idea!

Journals, in the past, published what researchers wanted to talk about. That is what dissemination is about, of course. Like everything, over time, the process becomes more restricted and more bureaucratic. All for quality, of course. To provide and to formalize that scientific communication has diversity, many journals have different articles types. Letters to the Editor, Brief Communications, etc. Posting a brief idea, however, is for many journals not of enough interest.

Hence, a niche for the Journal of Brief Ideas. It's a project in beta, any may never find sustainability, but it is worth a try:

I can see why this may work:
  • you teamed up with ZENODO to provide DOIs
  • you log in with your ORCID
  • it is Open Access (CC-BY)
  • it fills the niche that ideas you will not tests never see the light of the day (so, this journal will contribute to more efficient scholarly communication)
I can also see why it may not work:
  • it is too easy to post an idea, leading to too much noise
  • it will not be indexed and therefore not fulfill a key requirements for many scientists (WoS, etc)
  • you cannot add references like with papers
I can also see some features I would love to see:
  • bookmarking buttons for CiteULike, Mendeley, etc
  • #altmetrics output on this site
  • provide #altmetrics from this site (view statistics, etc)
  • integrate with peer review cites (for post-publication peer review)
  • allow annotation of entities in papers (like PDB, gene, protein codes, metabolite identifiers, etc; and whatever else for other scholarly domains)
Things I am not sure about:
  • allow a single ToC-like graphics (as they will give papers more coverage and more impact)
Anyway, what is needs now, is momentum. It needs a business model, even if the turnover can be kept low because of good choices of technology. I am looking forward where the team is going, and how the community will pick up this idea. (For example, despite I know that some ideas are tweeted, I haven't found a donut from for one of the idea DOIs yet.)

For my readers, please give it a try. You know you have that idea you like to get some feedback on, but you know you will not have funding for it, and it does not really match what general research plans. It would be a shame to leave that idea rot on the shelf. Get it out, get cited!

I tried it too, see below my brief idea as found on ZENODO (where they automatically get deposited), and my experiences are a bit mixed. I like the idea, but it is also getting used to. The number of words are limited, and I really find it awkward not to cite prior art, the things I built on. The above points reflect a good deal of my reservations.

Friday, August 21, 2015

Internet-aided serendipity in science (was: How the Internet can help chemists with serendipity)

The ACS Central Science RSS feed in Feedly.
Finding new or useful knowledge to solve your scientific problem, question, etc, is key to research. It also is what struck me as a university student as so badly organized (mid-nineties). In fact, technologically there was no issue, so why are scientists not using these technologies then?? This question is still relevant, and readers of this blog know this is a toy research area to me, and I have previously experimented with a lot of technologies to see how they can support research, and, well, basically, serendipity. Hence, internet-aided serendipity.

This happened to be the topic of an article by Prof. Bertozzi (@CarolynBertozzi), editor-in-chief of the gold Open Access ACS Central ScienceHow the Internet can help chemists with serendipity, part of the website. I left a comment, which is currently awaiting moderation, but to keep the discussion on twitter going, here is what I left (the comment on the article may turn out to have lost the formatting still present here):
    Dear Prof Bertozzi,

    the browsing of TOCs is not a lost art, and neither has the Internet solved everything. Where I fully agree that Twitter and other social media have filled a niche in finding interesting literature, it is basically kind of a majority vote and does not really find you the papers interesting to your research. This has to extend, of course, to #altmetrics, which capture the attention on social media and allows creating TOCs on the fly, as do (good) paper bookmarking services like CiteULike (see Similarly, people developed tools to find science in blog posts, like the no longer existing, continued/forked as Chemical blogspace (see, but consider this code has not been updated in the past 2-3 years). So, creating cross-journal TOCs is a daily habit for many of us still. (BTW, will ACS Central Science fully adopt #altmetrics, as data provider as well as showing #altmetrics on the website?)

    Returning to the single journal TOCs. Here, RSS feeds have shown to be critical, happy to find a RSS feed for ACS Central Science ( It is good to see that the journal's RSS feed for the ASAP papers contains for each paper the title, authors, the TOC image, and the DOI (possibly, it could also include the abstract and ORCIDs of the authors). Better, it should adopt CMLRSS and include InChIs, MDL molfiles, or SMILES of the chemical compounds discussed in that paper (see this ACS JCIM paper: With proper adoption of CMLRSS, chemists could define substructures and be alerted when papers would be published containing chemicals with that substructure (and it does not have to stop there, as cheminformatically it is trivial to extend this to chemical reactions, or any other chemistry). After all, we don't want to miss the chemistry that sparks our inspiration!

    I personally keep track of a number of journals via RSS feeds which I aggregate in Feedly, which filled the gap after GoogleReader was closed down. Feedly does not support CMLRSS (unfortunately, but I have other tools for that) and there are a few alternatives.

    So, I hope the ACS Central Central journal will pick up your challenge and continue to support modern (well, CMLRSS was published in 2004) technologies to support your past workflows! For example, make the link to the ACS Central Science RSS feed more prominent, and write an editorial about how to use it with, for example, Feedly.

    Maastricht University
    The Netherlands
Of course, there is a lot more. It should not surprise you that adoption of PDF and ReadCube as killing internet-aided serendipity, where HTML+RDF, microformats,, etc would in fact enable serendipity. Chemistry publishers do not particularly have a track record in enabling the kind of serendipity Prof. Bertozzi is looking for. Good thing is that as editor-in-chief of an ACS journal, she can restore this serendipity and I kindly invite her to the Blue Obelisk community to discuss how all the technologies that have been developed in the past 15 years can help chemists. Because we have plenty of ideas. (And where is that website again aggregating chemistry journal RSS feeds...?)

Or, just browse this posts in blog, where I have frequently written about the innovation with publishers (in general; some do better than others).

Update: Other perspectives