Saturday, June 29, 2013

Breaking the power law in database access? #WikiPathways

Data access is one of those things that interests me: how can be improve data to be found. The current state is that it is effectively absent. Sure, some large databases are available, and I will get to that shortly, because there are problems there too. But there are also vast amounts of small data that we cannot access. FigShare and PLOS are doing something about this, but there is so much more inaccessible at this moment. And then we haven't even touched on negative data, which is critical to data analysis (as well as efficiency).

So, back to the large databases. WikiPathways is one. Not even close to being large enough (you can help, possibly during a curation jamboree, and the next one is just around the corner), but that's another story. But even for large databases, some data is more found than other data. For example:

That is, some pathways are explored much more often than others, and the patterns suggests something like a power law. And that is bad. More accessed pathways will get more attention in the future. This causes a bias. One such bias I fear is that the more popular pathways will be better annotated. That in itself is not bad, but has major implications on the data analysis that follows (and I am now curating the metabolite annotation of pathways of interest to Open PHACTS). After all, the better annotated, the higher the chance it will show up in pathway enrichment analyses. That is bad.

Thus, my question is basically how we can break this pattern. How can we break this power law?

Top X lists
One culprit that encourages this power law is that some pathways are "featured", the top X list. WikiPathways has a "Most Viewed" list, which is an obvious problem. The "Most Edited" list is probably very correlated. BTW, it would be great to have access statistics available as RDF, so that we can easily analyse the link between these power laws and quality. To test the hypothesis that more edited pathways are indeed more accurate, and if there indeed is a correlation between access and quality (think "given enough eyeballs, all bugs are shallow"). Or maybe it is just the pathways exposed in WikiPedia that top the ranking?

So, how can we make the long tail more accessible? The portals may doing w good job, and the new Plant Portal will certainly help with making those pathways more accessible (doi:10.1186/1939-8433-6-14). But again, only the pathways that have been shortlisted for that portal.

Meta Databases
Making pathways accessible via the biological entities is another candidate. This is one of the things the NCBI BioSystems Database (doi:10.1093/nar/gkp858enables:

But since this database already contains WikiPathways, and we still see the power law... anyway, from a metabolite perspective this is nice, and WikiPathways are visible from PubChem (along with pathways from other resources available via BioSystems):

Other options?
What other options exist to break this power law in data content access? How can we more effectively expose the long tail? Your ideas are most welcome!

Tuesday, June 18, 2013

Minting RDF from CSV files with Bioclipse #MIILS2013

The below slides are part of the introduction for the hands on for the #MIILS2013 course this afternoon. I had the participants look at creating RDF the hard way: using Bioclipse script (using this Bioclipse-OpenTox version). And, they had to follow the Open PHACTS RDF Guidelines, VoID specification, etc.

Friday, June 14, 2013

The Death of the Single Pass Peer Review

One key aspect of peer review is to ensure that the paper is scientifically sound. The publishing system therefore asks peers to review the paper and test that. Part of this is to see if relevant literature is cited, backing up key finding around the addressed hypothesis in the paper. After all, the research must be novel.

However, there is so much literature nowadays in so many different journals, a phenomenon initiated by the large publishing industries, that it has become hard to track things. No worries, these same industries have come up with tools to handle that. A second problem is that research has become that interdisciplinary that even focusing on  a few journals of which you can even read the papers easily is hard.

Consequently, it is easy to miss even key papers. He who reads all relevant, important literature cast the first tweet. Thus, the peer reviewers jump in, and make you aware if you missed something critical.

Now, these peers actually have the same issue. Worse, because it is hard to find good reviewers, we settle with post-docs and peers not specialist in the topic of the paper. Of course, we'll find plenty of reasons why this is good. But the bottom line is, that even peer review fails to address all critical points in the manuscript. And, as a result, we can all point to literature in the glossies where important aspects have been neglected (he who never found a "high impact" paper without flaws, cast the first post), and an increase in the number of retractions recently.

Therefore, what the publishing community needs is to admit that the current approach is no longer sufficient. It worked well, for some 40-50 years, we did not need it before then, and it no longer scales with the output. I am not implying that peer review is bad, but the current single pass peer review implementation is.

Instead, I call on all publishers to step away from the current implementation, and adopt a multistep peer review process. Possible approaches include:

  1. only accept papers that have appeared in a pre-print server
  2. implement a two-pass system, with a quick first review scanning if key literature has been discussed, just covering the introduction
  3. open peer review
I know that some of these solutions are being experimented with. All I ask is for publishers to strongly support this, and demand from editors and authors to do the same.

Why? Because your journal quality will actually improve.