Tuesday, December 21, 2010

Commercial or Proprietary?

Khanna and Ranganathan wrote up a review paper on molecular similarity (doi:10.1002/ddr.20404). I have not fully read it yet, but my eye fell on Table 1, which lists a number of programs that can be used to calculate QSAR descriptors, both open source and proprietary. However, the table features a column Availability which has two options: Public, Commercial. They qualify Bioclipse, CDK, and RDKit as public, and Dragon, MOE, CODESSA and others are commercial. Effectively, it seems to suggest that they classify them as open source versus commercial, though I am not entirely sure what they mean with public.

The authors and referees would not be the first to make this common mistake: not express the difference between two orthogonal axis: free-versus-commercial and open source-versus-proprietary. To clarify these axes, I created this diagram (CC-BY, SVG source available upon request):

It is very important to realize the Open Source software can be commercial. For example, you can get commercial support for Bioclipse and CDK with GenettaSoft. It is also really important to realize that free software (public?) does not mean it is Open Source (or visa versa). E-Dragon is an example here: you can freely use, but the source code is proprietary. Some years after open source cheminformatics took off, commercial providers started to provide free-for-academic-use packages, which fits into this category too.

Readers of my blog know that I advocate Open Source, not gratis software (see also Re: Why I and you should avoid NC licences), even though you can download many of the Open Source cheminformatics tools I worked on for free. Here it is important to realize that the CDK and Bioclipse are not free: it is just that the tax-payer covered the cost via academic institutes mostly, as well as hobbyists working out-of-office, like I have done for many, many years, and companies who saw mutual benefit. Maybe something to consider the next time you are wondering about donating money to an Open Source cheminformatics project, and pay some respect to the project contributors of the software you use.

Off-topic: there is a second inaccuracy in this table. For each software, they list the number of descriptors, but without units. Units, units?? Yes. For example, for the CDK they list ">40" descriptors, while for Dragon "3,224" (it puzzles me why you can count accurately above 3000, but not below 50. But the point here is that the CDK count is really the number of Java classes, reflecting descriptor algorithms. One algorithm can calculate more than one descriptor value, and those are counted for Dragon. The columns is comparing apples with oranges. While I have never really counted it, and you every CDK user can in fact tune it, the number of calculated CDK descriptor values approaches a thousand. Well, I guess that is ">40" too :(

Khanna, V., & Ranganathan, S. (2010). Molecular similarity and diversity approaches in chemoinformatics Drug Development Research DOI: 10.1002/ddr.20404