Sunday, March 22, 2015

"What You're Doing Is Rather Desperate"

Self-protrait by Gustave Courbet.
Source: WikiMedia Commons.
One blog that I have been (occasionally but repeatedly) reading for a long time is the What You're Doing Is Rather Desperate blog by Neil Saunders. HT to WoW!ter for pointing me to this nice post where Saunders shows how to calculate the number of entries in PubMed marked as retracted (in two seperate ways). The spoiler (you should check his R scripts anyway!): about 3900 retractions.

I have been quite interested in this idea of retractions and recent discussions with Chris (Christopher: mostly offline this time, and even some over a beer :) about if retractions are good or bad (BTW, check his inaugural speech on YouTube). He points out that retractions are not always for the right reasons, and, probably worse, have unintended consequences. An example he gave is a more senior researcher with two positions; in one lab someone misbehaved and this lab did not see any misconduct of this senior researcher; however, his other affiliation did not agree and fired him.

Five years ago I would have sad any senior researcher on a paper should still understand things in detail, and if misconduct was found, the senior authors are to blame too. I still believe this is the case, that's what you're co-researcher for, but feeling the pressure of publishing enough and just not having enough time, I do realize I cannot personally reproduce all results my post-docs and collaborators do. But we all know that publishing has made a wrong turn and, yes, I am trying to make it return to a better default.

Why I am interested in retractions
But that is not why I wanted to blog and discuss Saunders post. Instead, I want to explain why I am interested in retractions and, another blog you should check out, Retraction Watch. In fact, I am very much looking forward to their database! However, this is not because of the blame game. Not at all.

Instead, I am interested in noise in knowledge. Obviously, because this greatly affects my chemblaics research. Particularly, I like to reduce noise or at the very least take appropriate measures when doing statistics as we have plenty of means to deal with noise (like cancelling it out). Now, are retractions then a appropriate means to find incorrect, incomplete, or just outright false knowledge? No. But there is nothing better.

There are better approached: I have long and still am advocating the Citation Typing Ontology, though I have to admit I am not up to date with David Shotton's work. The CiTO allows to annotate if two papers disagree or if it agrees and perhaps even uses the knowledge. It can also annotate the citation as merely being included because the cited paper has some authority (expect many of those to Nature and Science papers).

But we have a long way to go before using CiTO becomes a reality. If interested, please check out the CiteULike support and Shotton's OpenCitations.

What does this mean for databases?
Research depends on data, some you measure, some you get from literature and increasingly databases. The latter, for example, to compare your own results with other findings. It is indeed helpful that databases provide these two functions:
  1. provide means to find similar experiments
  2. provide a gold standard of true knowledge
These database will have two different approaches: the first will present the data as reported or even better as raw data (as it came from the machine, unprocessed, though increasingly the machines already do processing to the best of its knowledge); the second will filter out true facts, possibly normalizing data along the way, e.g. by correcting obvious typing and drawing errors.

Indeed, database can combine these features, just like PubChem and ChemSpider do for small compounds. PubChem has the explicit approach of providing both the raw input from sources (the substances with SIDs) and the normalized result (the compounds with CIDs).

But what if the original paper turned out the be wrong? There are (at least) two phases:
  1. can we detect when a paper turns out wrong?
  2. can we propagate this knowledge into databases?
The first clearly reflects my interest in CiTO and retractions. We must develop means to filter out all the reported facts that turn out to be incorrect. And, can we efficiently keep our thousands of databases clean (many valid approaches!)? Do retractions matter here? Yes, because research in so-called higher impact journals is also seeing more retractions (whatever the reason is for that correlation), see this post by Bjoern Brembs

Where we must be heading
What the community needs to develop in the next few years is approaches for propagation of knowledge about correct and incorrect knowledge. That is, high impact knowledge must enter databases quickly, e.g. the exercise myokine irisin, but also the fact that it was recently shown it very likely doesn't exist, or at least that the original paper most likely measured something else (doi:10.1038/srep08889). Now, this is clearly a high-profile "retraction" of facts that few of us will have missed. Right? And that's where the problem is, the very long tail of disproven knowledge is very long, and we cannot rely on such facts to propagate quickly if we do not use tools to help us. This is one reason why my attention turned to semantic technologies, so that contradictions can be found more easily.

But I am getting rather desperate about all the badly annotated knowledge in databases, and I am also getting desperate about being able to make a change. The research I'm doing may turn out rather desperate.

Monday, March 16, 2015

Ambit.js: JavaScript client library for the eNanoMapper API technical preview

eNanoMapper is passed its first year, and an interesting year it has been! I enjoyed the collaboration with all partners very much and also the Open character of it. Just check our GitHub repository or our Jenkins build server.

Just this month, the NanoWiki data on Figshare was released and I just got around to uploading ambit.js to GitHub. This library is something in development, and should too be considered a technical preview. This JavaScript client library, inspired by Ian Dunlop's ops.js for Open PHACTS, allows visualization of data in the repository in arbitrary web pages, using jQuery and d3.js.

The visualization on the right shows the distribution of nanomaterial types in the preview server at (based on AMBIT and the OpenTox API), containing various data sets (others by IDEA and NTUA), including the above mentioned NanoWiki knowledge base that I started in Prof. Fadeel's group at Karolinska Institutet. This set makes up about 3/4th of all data, and effectively excludes the orange 'nanoparticle' and the blue section due north. You can see I collected mostly data for metal oxides and some carbon nanotubes (though I did not digitize a lot of biological data for those).

But pie charts only work on pi days so let's quickly look at another application: summarize the size distribution:

Or what about the zeta potentials? Yes, the suggestion to make it a scatter plot with the pH when the potential was mentioned is already under consideration :)

Do you want to learn more? Get in contact!

Friday, March 13, 2015

Google Code closing down: semanticchemistry and chemojava

Google Code is closing down. I had a few projects running there and participated in semanticchemistry which hosts the CHEMINF ontology (doi:10.1371/journal.pone.0025513). But also ChemoJava, an extension of the CDK with GPL-licensed bits.

Fortunately, they have an exporter which automates the migration of the project to GitHub. And this is now in progress. Because the exporter is seeing a heavy load, they warn about a export times of up to twelve hours! The "0% complete" is promising, however :)

For the semanticchemistry project, I have asked the other people involved where we want to host it, as GitHub is just one of the options. A copy does not hurt anyway, but the one I am currently making may very well not be the new project location.

PubMed Commons
When you migrate your own projects and you published your work referring to this Google Code project page, please also leave a comment on PubMed Commons pointing readers to the new project page!

Tuesday, February 24, 2015

Migrating to CDK 1.5: SSSR and two other ring sets

I arrived a bit earlier at the year 1 eNanoMapper meeting so that I could sit down with Nina Jealiazkova to work on migrating AMBIT2 to CDK 1.5. One of the big new things is the detection of ring systems, which John gave a major overhaul. Part of that is that the SSSRFinder is now retired into the legacy model. Myself, I haven't even gone into all the new goodies of this, but one of the things we needed to look at, is how to replace the use the SSSRFinder use for finding the SSSR set. So, I just had a check if I understood the replacement code, by using the SSSRFinderTest and replacing the code in those tests with the new code, and all seemed to work as I expect it to do.

So, here are a few examples of code snippets you need to replace. Previously, your code would have something like:

IRingSet ringSet = new SSSRFinder(someAtomContainer).findSSSR();

This code is replaced with:

IRingSet ringSet = Cycles.sssr(someAtomContainer).toRingSet();

Similarly, you could have had something like:

IRingSet ringSetEssential =
  new SSSRFinder(buckyball).findEssentialRings();
IRingSet ringSetRelevant =
  new SSSRFinder(buckyball).findRelevantRings();

This is changed to:

IRingSet ringSetEssential = Cycles.essential(buckyball).toRingSet();
IRingSet ringSetRelevant = Cycles.relevant(buckyball).toRingSet();

Sunday, January 11, 2015

Programming in the Life Sciences #21: 2014 Screenshots #1

December saw the end of this year's PRA3006 course (aka #mcspils). Time to blog some screenshots of the student projects. Like last year, the aim is to use the Open PHACTS API to collect data with ops.js and which should then be visualized in a HTML page, preferably with d3.js. This year, all projects reached that goal.

ACE inhibitors
The first team (Mischa-Alexander and Hamza) focused on the ACE inhibitors (type:"drug class") and the WP554 from WikiPathways. The use a tree structure to list inhibitors along with their activity:

The source code for this project is available under a MIT license.

The second team (Catherine and Moritz) looked at compounds hitting diabetes mellitus targets. They take advantage from the new disease API methods and first ask for all targets for the disease, and then query for all compounds. Mind you, the compounds are not filtered by activity, so it mostly shows interactions that real targets.

This product too is available with the MIT license.

The third project (Nadia en Loic) also goes from disease to targets and they looked at tuberculosis.

Asynchronous calls
If you know the ops.js, d3.js, and JavaScript a bit, you know that these projects are not trivial. The remote web service calls are made in an asynchronous manner: each call comes with a callback function that gets called when the server returns an answer, at some future point in time. Therefore, if you want to visualization, for example, compounds with activities against targets for a particular disease, you need two web service calls, with the second made in the callback function of the first call. Now, try to globally collect the data from that with JavaScript and HTML, and make sure to call the visualization call when all information is collected! But even without that, the students need to convert the returned web service answer into a format that d3.js can handle. In short: quite a challenge for six practical days!