Saturday, June 29, 2013

Breaking the power law in database access? #WikiPathways

Data access is one of those things that interests me: how can be improve data to be found. The current state is that it is effectively absent. Sure, some large databases are available, and I will get to that shortly, because there are problems there too. But there are also vast amounts of small data that we cannot access. FigShare and PLOS are doing something about this, but there is so much more inaccessible at this moment. And then we haven't even touched on negative data, which is critical to data analysis (as well as efficiency).

So, back to the large databases. WikiPathways is one. Not even close to being large enough (you can help, possibly during a curation jamboree, and the next one is just around the corner), but that's another story. But even for large databases, some data is more found than other data. For example:

That is, some pathways are explored much more often than others, and the patterns suggests something like a power law. And that is bad. More accessed pathways will get more attention in the future. This causes a bias. One such bias I fear is that the more popular pathways will be better annotated. That in itself is not bad, but has major implications on the data analysis that follows (and I am now curating the metabolite annotation of pathways of interest to Open PHACTS). After all, the better annotated, the higher the chance it will show up in pathway enrichment analyses. That is bad.

Thus, my question is basically how we can break this pattern. How can we break this power law?

Top X lists
One culprit that encourages this power law is that some pathways are "featured", the top X list. WikiPathways has a "Most Viewed" list, which is an obvious problem. The "Most Edited" list is probably very correlated. BTW, it would be great to have access statistics available as RDF, so that we can easily analyse the link between these power laws and quality. To test the hypothesis that more edited pathways are indeed more accurate, and if there indeed is a correlation between access and quality (think "given enough eyeballs, all bugs are shallow"). Or maybe it is just the pathways exposed in WikiPedia that top the ranking?

So, how can we make the long tail more accessible? The portals may doing w good job, and the new Plant Portal will certainly help with making those pathways more accessible (doi:10.1186/1939-8433-6-14). But again, only the pathways that have been shortlisted for that portal.

Meta Databases
Making pathways accessible via the biological entities is another candidate. This is one of the things the NCBI BioSystems Database (doi:10.1093/nar/gkp858enables:

But since this database already contains WikiPathways, and we still see the power law... anyway, from a metabolite perspective this is nice, and WikiPathways are visible from PubChem (along with pathways from other resources available via BioSystems):

Other options?
What other options exist to break this power law in data content access? How can we more effectively expose the long tail? Your ideas are most welcome!