Sunday, December 13, 2015

SWAT4LS in Cambridge

Wordle of the #swat4ls tweets.
Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

I have mixed feelings about missing half of the workshops on Monday for a visit of one of our Open PHACTS partners, but do not regret that meeting at all; I just wish I could have done both. During the visit we spoke particularly about WikiPathways and our collaboration in this area.

The Monday morning workshops were cool. First, Evan Bolton and Gang Fu gave an overview of their PubChemRDF work. I have been involved in that in the past, and I greatly enjoyed seeing the progress they have made, and a rich overview of the 250GB of data they make available on their FTP side (sorry, the rights info has not gotten any clearer over the years, but generally considered "open"). The RDF now covers, for example, the biosystems module too, so that I can query PubChem for all compounds in WikiPathways (and compare that against internal efforts).

The second workshop I attended was by Andra and others about Wikidata. The room, about 50 people, all started editing Wikidata, in trade of a chocolate letter:

The editing was about prevalence is two diseases. Both topics continued during the hackathon, see below. Slides of this presentation are online. But I missed the DisGeNET workshop, unfortunately :(

The conference itself (in the new part of Clare College, even the conference dinner) started on the second day, and all presentations are backed by a paper, linked from the program. Not having attended a semantic web conference in the past 2~ish years, it was nice to see the progress in the field. Some papers I found interesting:
But the rest is most worthwhile checking out too! The Webulous I as able to get going with some help (not paying enough attention to the GUI) for eNanoMapper:

A Google Spreadsheet where I restricted the content of a set of cells to only subclasses of the "nanomaterial" class in the eNanoMapper ontology (see doi:10.1186/s13326-015-0005-5).
The conference ended with a panel discussion, and despite our efforts of me and the other panel members (Frank Gibson – Royal Society of Chemistry, Harold Solbrig – Mayo Clinic, Jun Zhao, University of Oxford), it took long before the conference audience really started joining in. Partly this was because the conference organization asked the community for questions, and the questions clearly did not resonate with the audience. It was not until we started discussing publishing that it became more lively. My point there was I believe the semantic web applications and tools are not really a rate limiting factor anymore, and if we really want to make a difference, we really must start changing the publishing industry. This has been said by me and others for many years already, but the pace at which things change it too low. Someone mentioned a chicken-and-egg situation, but I really believe it is all just a choice we make and an easy solution: pick up a knife, kill the chicken, and have a nice dinner. It is annoying to see all the great efforts at this conference, but much of it limited because our writing style makes nice stories and yields few machine readable facts.

The hackathon was held at the EBI in Hinxton (south/elixir building) and during the meeting I had a hard time deciding what to hack on: there just were too many interesting technologies to work on, but I ended up working on PubChem/HDT (long) and Wikidata (short). The timings are based on the amount of help I needed to bootstrap things and how much I can figure out at home (which is a lot for Wikidata).

HDT (header, dictionary, triple) is a not-so-new-but-under-the-radar technology for binary storing triples in a file based store. The specification outlines this binary format as well as the index. That means that you can share triple data compressed and indexed. That opens up new possibilities. One thing I am interested in, is using this approach for sharing link sets (doi:10.1007/978-3-319-11964-9_7) for BridgeDb, our identifier mapping platform. But there is much more, of course: share life science databases on your laptop.

This hack was triggered by a beer with Evan Bolton and Arto Bendiken. Yes, there is a Java library, hdt-java, and for me the easiest way to work out how to use a Java API, is to write a Bioclipse plugin. Writing the plugin is trivial, though setting up a Bioclipse development is less so: the New Wizard does the hard work in seconds. But then started the dependency hacking. The Jena version it depended on is incompatible with the version in Bioclipse right now, but that is not a big deal for Eclipse, and the outcome is that we have both version on the classpath :) That, however, did require me to introduce a new plugin, net.bioclipse.rdf.core with the IRDFStore, something I wanted to do for a long time, because that is also needed if one wants to use Sesame/OpenRDF instead of Jena.

So, after lunch I was done with the code cleanup, and I got to the HDT manager again. Soon, I could open a HDT file. I first had the API method to read it into memory, but that's not what I wanted, because I want to open large HDT files. Because it uses Jena, it conveniently provides a Jena Model object, so adding SPARQL-ing support was easy; I cannot use the old SPARQL-ing code, because then I would start mixing Jena versions, but since all is Open Source, I just copied/pasted the code (which is written by me in the first place, doi:10.1186/2041-1480-2-s1-s6, interestingly, work that originates from my previous SWAT4LS talk :). Then, I could do this:
It is file based, which has different from a full triple store server. So, questions arise about performance. Creating an index takes time and memory (1GB of heap space, for example). However, the index file can be shared (downloaded) and then a HDT file "opens" in a second in Bioclipse. Of course, the opening does not do anything special, like loading into memory, and should be compared to connecting to a relational database. The querying is what takes the time. Here are some numbers for the Wiktionary data that the RDFHDT team provides as example data set:
However, I am not entirely sure what to compare this against. I will have to experiment with, for example, ChEMBL-RDF (maybe update the Uppsala version, see doi:10.1186/1758-2946-5-23). The advantage would be that ChEMBL data could easily be distributed along with Bioclipse to service the decision support features. Because the typical query is asking for data for a parcicular compound, not all compounds. If that works within less than 0.1 seconds, then this may give a nice user experience.

But before I reach that, it needs a bit more hacking:
  1. take the approach I took with BridgeDb mapping databases for sharing HDT files (which has the advantage that you get a decent automatic updating system, etc)
  2. ensure I can query over more than one HDT file
And probably a bit more.

Wikidata and WikiPathways
After the coffee break I joined the Wikidata people, and sat down to learn about the bots. However, Andra wanted to finish something else first, where I could help out. Considering I probably manage to hack up a bot anyway, we worked on the following. Multiple database about genes, proteins, and metabolites like to link these biological entities to pathways in WikiPathways (doi:10.1093/nar/gkv1024). Of course, we love to collaborate with all the projects that integrate WikiPathways into their systems, but I personally rather use a solution that services all needs. If only because then people can do this integration without needing our time. Of course, this is an idea we pitched about a year ago in the Enabling Open Science: WikiData for Research proposal (doi:10.5281/zenodo.13906).

That is, would it not be nice of people can just pulled the links between the biological entities to WikiPathways from Wikidata, using one of the many APIs they have (SPARQL, REST), supporting multiple formats (XML, JSON, RDF)? I think so, as you might have guessed. So does Andra, and he asked me if I could start the discussions in the Wikidata community, which I happily did. I'm not sure about the outcome, because despite having links like these is not of their prime interest - they did not like the idea of links to the Crystallography Open Database much yet, with the argument it is a one-to-many relation - though this is exactly what the PDB identifier is too, and that is accepted. So, it's a matter of notability again. But this is what the current proposal looks like:

Let's see how the discussion unfolds. Please feel tree to coin in and show your support, comments, questions, or opposition, so that we can together get this right.

Chemistry Development Kit
There is undoubtedly a lot more, but I have been summarizing the meeting for about three hours now, getting notes together etc. A last thing I want to mention now, is the CDK. Cheminformatics is, afterall, a critical feature of life science data, and spoke with a few about the CDK. And I visited NextMove Software on Friday where John May works nowadays, who did a lot of work on the CDK recently (we also spoke about WikiPathways and eNanoMapper). NextMove is doing great stuff (thanks for the invitation), and so did John during his PhD in Chris Steinbeck's group at the EBI. But during the conference I also spoke with others about the CDK and following up on these conversations.