Saturday, July 02, 2011

The KEGG subscription model

KEGG's primary funding ran out, and they decided to go for a subscription model, as you likely will have picked up by now. KEGG has been used a lot by many, likely largely caused by it being freely available before. But, KEGG is not Open Data, and this will slowly be realized by lots of biologists and bioinformaticians who will now have to pay from 2000 up to 5000 dollar.

The rationale is simple. 1. Funding ran out; 2. curation is expensive; 3. money is needed for continued evolution of the data. The next year will be very interesting for a number of reasons, some of which are like seeing the GPL being taken to court.

First of all, it is of utmost importance that the subscription supports the future development of KEGG, not the hosting. In fact, many have made a copy of relevant bits from the FTP site before it closed down. This data cannot be shared, but it is nevertheless. That takes us to my second observation: KEGG data is all around. Many sites are already (and have been for some years) redistributing the data, such as Bio2RDF and Chem2Bio2RDF who provide that data via a SPARQL end point, or otherwise. In fact, there are still dozens of places where you can download the KEGG data freely (as in free beer!).

Closely related to this is that multiple independent academic groups are using KEGG data, and have set up new metabolism-related websites, including the Human Metabolic Atlas, BioMeta, and many, many more. On top of that, there are many alternative database which provide the same kind of information, which will attract the lurking bioinformatician who does not have 2000 dollar to run a quick pathway enrichment test. That is, KEGG is in a market with a lot of competition. Though, the KEGG brand is strong, and could be enough for a vendor lock-in effect. (Group leaders may say "WTF for did you not use KEGG instead of this beta-brand database? Nature will never accept that!!". Of course, you could attempt starting a discussion about data quality, validation, BioMeta, ... good luck with that :)

What can KEGG do about protecting their IP? Well, as they never gave formally permission to redistribute the data, they might go after competing efforts which have used KEGG data. Will KEGG? I do not know; I hope not, because the bioinformatics community will probably object, driving people away from KEGG instead. Accept that situation then? I do not think that will work either, because lurking is just a sad fact of life science informatics: people take easily, but contributing back takes an effort.

What I hope will happen, and that is probably what KEGG is anticipating, is that all those derived databases will in fact take a license, though I have to say 5000 dollar is not much then, nor did I read anything about that allowing these derived databases to redistribute under such license.

My personal preference is a Open Data approach, where KEGG will work together with the other databases. However, political forces may be inhibiting this. How large is the chance that the Human Metabolic Atlas will drop their brand and join a KEGG consortium? How large is the chance that existing efforts will agree on a license?

Another thing that might happen is that KEGG will slowly disappear from the scene. Maybe people will realize that Open Data is in fact an important way to simplify international collaborations. Maybe Open projects like WikiPathways will now be preferred. Maybe we will see an Open Data KEGG commons, with branded web interfaces around this. The time is right. Open Access is booming, and Open Data is up next, and high on the list too. The question is how soon the biologists and bioinformaticians follow. Open Source, after all, is mostly liked because of the free beer by these groups, not because of their free speech character.