## Tuesday, December 21, 2010

### Commercial or Proprietary?

Khanna and Ranganathan wrote up a review paper on molecular similarity (doi:10.1002/ddr.20404). I have not fully read it yet, but my eye fell on Table 1, which lists a number of programs that can be used to calculate QSAR descriptors, both open source and proprietary. However, the table features a column Availability which has two options: Public, Commercial. They qualify Bioclipse, CDK, and RDKit as public, and Dragon, MOE, CODESSA and others are commercial. Effectively, it seems to suggest that they classify them as open source versus commercial, though I am not entirely sure what they mean with public.

The authors and referees would not be the first to make this common mistake: not express the difference between two orthogonal axis: free-versus-commercial and open source-versus-proprietary. To clarify these axes, I created this diagram (CC-BY, SVG source available upon request):

It is very important to realize the Open Source software can be commercial. For example, you can get commercial support for Bioclipse and CDK with GenettaSoft. It is also really important to realize that free software (public?) does not mean it is Open Source (or visa versa). E-Dragon is an example here: you can freely use, but the source code is proprietary. Some years after open source cheminformatics took off, commercial providers started to provide free-for-academic-use packages, which fits into this category too.

Readers of my blog know that I advocate Open Source, not gratis software (see also Re: Why I and you should avoid NC licences), even though you can download many of the Open Source cheminformatics tools I worked on for free. Here it is important to realize that the CDK and Bioclipse are not free: it is just that the tax-payer covered the cost via academic institutes mostly, as well as hobbyists working out-of-office, like I have done for many, many years, and companies who saw mutual benefit. Maybe something to consider the next time you are wondering about donating money to an Open Source cheminformatics project, and pay some respect to the project contributors of the software you use.

Off-topic: there is a second inaccuracy in this table. For each software, they list the number of descriptors, but without units. Units, units?? Yes. For example, for the CDK they list ">40" descriptors, while for Dragon "3,224" (it puzzles me why you can count accurately above 3000, but not below 50. But the point here is that the CDK count is really the number of Java classes, reflecting descriptor algorithms. One algorithm can calculate more than one descriptor value, and those are counted for Dragon. The columns is comparing apples with oranges. While I have never really counted it, and you every CDK user can in fact tune it, the number of calculated CDK descriptor values approaches a thousand. Well, I guess that is ">40" too :(

Khanna, V., & Ranganathan, S. (2010). Molecular similarity and diversity approaches in chemoinformatics Drug Development Research DOI: 10.1002/ddr.20404

1. Hi Egon,

This is a very good point. It's been quiet some time ago I used MOE but I think a large part of the code was Open Source. If you've a license you can see and adapt the code. There used to be a way to share code with other users as well. This should be enough to call it Open Source rigth?

2. Hi Peter, I am not actively reading licenses of non-free offerings. I know that EULAs and licenses are becoming more flexible (like free-for-academic-use), and this might very well involve some rights to share changes. But I find it hard to believe that MOE decided to use a OSI-approved Open Source license?

3. As the Open Source definition begins, "Open source doesn't just mean access to the source code." It must also be free to modify and distribute (among other things).

4. Hi

MOE is built using the SVL programming language. The SVL compiler is closed, and you need a license key to use it.

Most of the SVL functions that you use from day to day are defined in SVL code themselves. However, some SVL functions are defined in the binary file, and the source for these is closed.

All the descriptors in MOE are written in SVL code so you can view, change and distribute them.

You can also write code in SVL and then sell it. You can only use this SVL code with a MOE license, which is where we make our money. You can't modify SVL code that comes from the release and sell that, so the MOE license agreement is not OSI-approved.

MOE and SVL are 100% commercial - it was not developed in academia.

Cheers

Andrew Henry (CCG support)

5. This comment has been removed by the author.

6. You combined two different meanings of "commercial" when you say that Bioclipse is "commercial / open source" because it has commercial support.

"Commercial" usually means that you pay money to get access to the software. It's part of commerce. For example, I could sell a copy of MegaDescriptor 4.3 to you, under the GPL. That would be commercial open source, but that possibility isn't included in your grid.

If you provide support then you are providing a service. That's a different category than getting the source. If someone wants to pay me to support OpenBabel or RDKit then I'll do that. Does that make OpenBabel or RDKit also commercial open source?

Also, could you provide a timeline to your history? When did "open source cheminformatics [take] off"? I assume with OELib, which was OpenEye's GPLv2 library which became OpenBabel. OELib definitely had commercial support then.

OpenEye's proprietary replacement toolkit has always had a free-for-research license. That happened in about 2000 so "some years" earlier means 1997 or earlier, but I can't think of what open source package you're talking about from that era which could be said to "take off."

7. @Andrew, the commercial axis refers here to paying for the product. This can indeed mean various thing: commercial support, commercial (contract) development, or paying for the CD to get the software.

The purpose of the plot is to show that open source products are not necessarily free (as in free beer), and that commercial products are not necessarily commercials (sold).

Your example MegaDescriptor would most certainly fit into the diagram: top-right quadrant. The 'usual' meaning of 'commercial' is irrelevant, as they are based on old economic models based on physical things sold.

8. @Andrew... can you shed some light, perhaps, on how much of Dalke Scientific is based on support for proprietary products versus support for open source products? A time line showing that ratio would be really interesting too!

9. I researched this some and found that my statement on what "commercial software" means is at odds with the FSF definition. I will defer to them and include support agreements as part of being commercial, as you also do.

I do think that monetary exchange to get access to the software itself is different than support. I've often pointed out that a problem with saying that research software must necessarily be free-as-in-freedom software so that people can access and review the source means that I can still require the first person to pay $10,000 to get the GPL code, and that hinders effective software review. I also disagree that this is an outdated model, as it's very close to the custom software I develop. Still, this is a different topic than this thread. My reply was too long for blogger.com so I've broken it up into parts. 10. As for my income, I haven't broken it down but I think about 10% comes from developing non-chemistry open source software for general release, about 5% comes from chemistry-released open source. About 25% of my income comes from training (which uses a mix of open and proprietary software), and the rest from developing in-house software "for-hire" where my clients get whatever I write. These rarely see the light of day. As for timeline - this is the second year I've managed to find someone willing to pay me for something which directly supports free chemistry software. The first was 10 years ago and it was for a PyDaylight extension to better support the proprietary Daylight toolkit. Financially I regard the free software I development as advertising for my skills as a custom software developer, consultant. and trainer, and as way to improve my overall skill set. 11. (Apologies if there's a duplicate. Blogger.com didn't seem to like my first attempts to post.) I was not that clear in my first response so I'll try again. You wrote: "Some years after open source cheminformatics took off, commercial providers started to provide free-for-academic-use packages, which fits into this category too." OpenEye distributed OELib, which was a GPLv3 cheminformatics package, with commercial support, starting in 1998 or 1999 at the latest. That's when I first saw the code but it may have been a pre-release. Some of the history is at http://demo.eyesopen.com/support/misc/WhyNotOELib.html . In 2002 they released a proprietary toolkit called OEChem and stopped OELib development. This included a no-cost license for academics. OELib forked off and became OpenBabel. So when you write that "some years after open source cheminformatics took off", do you mean that OELib is the point where open source cheminformatics took off? If not, then when did it take off and what was the code base? If it was later then 1999 then your timing is off, because free-for-academic-use packages started no later than 2002. I suspect others had free academic licenses as well since my memory is that was pretty common in the 1990s, but I have no evidence of that. The only toolkits I can recall at the moment from that era are ChemSymphony (50% academic discount according to archive.org), Daylight (no license information available but I vaguely recall some sites had a no-cost license) and MDLi (web site not archived). What records do you have which say when no-cost-to-academic licenses started? 12. @Andrew: "I do think that monetary exchange to get access to the software itself is different than support." If I was not clear on this, I think that these are different models too. "I also disagree that this is an outdated model" I think proprietary business models have their place; I would not say them to be outdated. I do think they are not good for progress in science. We have repeatedly spoken of Open Source and code review, but the latter is most certainly not the only reason for Open Source in Science. Regarding OpenEye's OELib. I think this has been an important step in the take off of open source cheminformatics. It has given rise to two existing projects indeed: JOElib and OpenBabel. "OpenEye distributed OELib, which was a GPLv3 cheminformatics package, with commercial support, starting in 1998 or 1999 at the latest." That means at about the same time or after Jmol and JChemPaint came into existing. My first encounter with open source 'cheminformatics' was Sun's XYZApp which I changed at the time to support PDB files: http://java.sun.com/applets/jdk/1.3/demo/applets/MoleculeViewer/XYZApp.java This applet goes back to at least 1997. This is surely not much of cheminformatics yet, but for me was the first encounter with it anyway. It was part of at least Java 1.1.6: http://stuff.mit.edu/afs/athena.mit.edu/software/java/java_v1.1.6/distrib/i386_rh9/demo/MoleculeViewer/ It is a shame that I do not have many backups of before 1998, and was certainly not using version control systems back then :( OpenEye was ahead of industries in making OELib GPLv2. They gave up after 4-5 years. It took the CDK about ten years, and even now I have to fight hard to get it maintained properly. Where open source cheminformatics is taking off, is that pharma industries are picking up open source cheminformatics software (and have so for many years now), but also are starting to openly contribute back. Open Source cheminformatics has not reached the level of offerings from proprietary vendors, but it has something to offer those vendors cannot: open source and lower prices. This is a trade off, but the interesting thing of open source is that the functionality just keeps building up. 13. @Andrew, it is also worth pointing out that I refer to products in this diagram, not vendors. Moreover, even products are not well defined, as you can surely download Bioclipse for free too. Several more traditional commercial vendors have started releasing OpenSource cheminformatics software, including Schrödinger and Molecular Networks. But that's the point of my blog post anyway. 14. (Oops, I meant to write "GPLv2" and not "GPLv3." Thanks for the correction!) While you started with JChemPaint and XYZApp, I would say there are better reference points for this discussion. We looked at JChemPaint in about 1998. Its SMILES support was limited to the organic subset and some of the data structure implementations were such that O(n) operations with the Daylight toolkit were O(n**2) with JChemPaint. While we could have fixed all of those, we were not a Java shop and we wanted to work on HTS data mining, not base cheminformatics implementations which we could license from OpenEye. I bring this up not to besmirch JChemPaint but to emphasize that you can't compare its start date to the date when OpenEye had a commercial offering of OELib. Otherwise you would have to pull back to Babel's first incarnation in about 1991 or so. If you dig up the old Babel documentation it's open source and has a call for contribution and a published mechanism for adding new I/O routines. While XYZApp is important for you, RasMol is a better general example from the structure world. It was released to the public domain in about 1992, and the author, Roger Sayle, was hired by Glaxo to continue to develop it and release it in the public domain. People could, and did, include it in their own projects and the proprietary Chime is a descendant of that effort. And there was a lawsuit because of that branch. I bring these up because open source and pharma have at least a two decade history, so the trade off you talk about can't be the only dynamics at play. I presume these are factors Matt Stahl of OpenEye discussed in "Open-source software: not quite endsville" (Drug Discovery Today, v10, pp219-222 Feb 2005, doi:10.1016/S1359-6446(04)03364-1 ) but I haven't read it for 5 years and it's behind a$30 pay wall.

15. Rasmol is indeed an open source project that goes back before too. We had it installed on our university network when I started working on Sun's MoleculeViewer code. I do not remember if I knew at the time if it was open source or not.

Neither do I claim that JChemPaint were Jmol competitative. Proprietary cheminformatics offerings have a much longer history, and of course, have a very significant head start.

I am sorry the my comment about the 'take off' of open source cheminformatics has caused so much grieve. It's not important. It was certainly not the point of the blog post. Really, I do not care who did what open source cheminformatics project, or who started first.

Fact is, open source cheminformatics has taken off, and is here to stay. Notice while OEChem Dr. Who (per PMR's terminology) at some point was OpenEye, it has been taken over by others. The proprietary equivalent does exist too, and is typically shaped as buy-outs.

Adoption of open source cheminformatics has many perspectives embedded in trade offs. Everyone makes a different one.

I always read Stahl's paper as rather supportive to Open Source. He observed problems with projects around open source, but he does not actually show that open source is the problem (though others do read that). Instead, he observes that the business model around open source is different. If you consider the current company-around-proprietary-code as the only viable solution, then models around open source at the time did not do well.

The world has changed, and e.g. KNIME is doing very well around open source, and as indicated, companies around proprietary products see themselves move to businesses around open source too (Schrödinger, Molecular Networks).

I can only guess what trade off they made, but they made one and decided open source for those projects was the way to go.

That is quite a change from 10 years ago.

It is also acknowledged that the original cheminformatics as developed 50 years ago, had little to do with open source or proprietary. Some people say people did not really care anyway, because getting your code running on a different machine was a problem in itself.

This caused some to believe one needs a commercial entity to bring the product to the market.

Surely, the internet, Open standards, etc, have changed all this. Open Babel, CDK, Rasmol, and many more, are installable with a click on a button, making the support needed for installing cheminformatics software minimal.

There are very many sides to all this. I really don't care about the exact definition of 'taken off', and happy with mine. I do find it very interesting to write up the history of open source cheminformatics, however, not sure if I can ever find funding to do this.

Here, an open community approach may be the only viable solution.