Tuesday, June 30, 2009

Making patches; Attribution; Copyright and License.

I have discussed this in the past on mailing lists, but realized yesterday that I need to strengthen the message a bit more. Just to remove confusion. The below is extracted from an email I sent this morning to the cdk-user mailing list, but I'm sure you can apply this to any other open source project. (Disclaimer: I have not studied international law, and the below cannot be used as legal advice. Like you would have! Hahahaha! Let it be pointers :)

1. What is that copyright/license header in that .java source file?

This header looks something along the lines of:

/* Copyright (C) 2000-2007 Christoph Steinbeck
* 2001-2007,2009 Egon Willighagen
* Contact:
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License
* as published by the Free Software Foundation; either version 2.1
* of the License, or (at your option) any later version.
* All we ask is that proper credit is given for our work, which includes
* - but is not limited to - adding the above copyright notice to the beginning
* of your source code files, and to any copyright notice that you may distribute
* with programs based on this work.
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* GNU Lesser General Public License for more details.
* You should have received a copy of the GNU Lesser General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.

This header has two major pieces: 1. who has the copyright on this file; 2. what is the license that makes is Open Source.

This is crucial information, and the CDK has a bad history in keeping track of primarily 1. Many source files, actually, still list the Chemistry Development Kit Project has copyright owner, which is a false statement, as the CDK Project is not a legal entity and in many countries therefore not allowed to own copyright. Moreover, none of the contributors ever signed a legal paper to re-assign copyright to this project anyway, like we do with many of our ACS papers.

2. But doesn't the Git/SVN/CVS history have this copyright owner information?

It is true that the Git, SVN and CVS histories of the CDK source code contain a lot of information on this. However, this is not helping, because this information is lost when we distribute our source code. And when others distribute our source code (e.g. Debian and Ubuntu), they have no means of keeping track of this.

Therefore, we must properly annotate our source files with this information.

3. Why is there a contact email?

Because the CDK project should still be the single point of entry regarding the source code.

4. What's with all those years listed in those copyright ownership lines?

Seriously, I have no clue, but every serious project does it, so there must be some legal reason. Indeed, it sounds logical to only list the years when you actually made changes. In the above hypothetical header, I made changes between 2001 and this year, but not in 2008. It might have to do with proper establishment of the end of the copyright period; see below.

5. When must I add my name to the copyright lines?

When you made a non-trivial contribution to the source code, and you must ensure you do this in each such contribution. By adding your name, you make clear that:

a. you are the original author of that contribution (and not someone else)
b. you release the software under the given license

This is the information (re)distributors need to know if they are working within the boundaries of law.

6. Should I complain about people not adding this information in their patches when reviewing those contributions?

Yes, you should.

7. What about all those files that still list Copyright (C) The CDK Project?

File bug reports. For each file, we need to read the commit history, extract the authors of all non-trivial contributions and when those contributions were made, and update the copyright lines.

8. Must the header always list the LGPL as license?

No. The LGPL is our license choice, but if you used code under another (compatible) license written by someone else, that original license applies, and that original license you need to provide in the header.

Additionally, do not forget to list the original copyright owners.

9. Can I rewrite GPL C++/C code as LGPL for the CDK?

Not entirely related to the above, but relevant. I once asked the FSF about this, and rewriting a piece of code in another language is *not* a clean room implementation and does, therefore, not erase original copyright ownership not license applicability. Therefore, we cannot base CDK implementations on, e.g. GPL-ed licensed C++ code, such as in OpenBabel.

10. How can I use GPL code in the CDK?

You cannot. All code depending GPL-code, must be GPL too. There is the ChemoJava project to hold GPL-licensed CDK-based code, which has a number of classes that use the CDK library and depend on GPL libraries

Alternatively, you can try to convince the original authors to relicense. A good recent example here, is the UFF force field implementation of OpenBabel which was relicensed (or dual licensed) as LGPL and now also available from Jmol. Incidentally, this is a reason why it is important to have those copyright lines include the email address, so that in the future people can contact the original authors of code. In a far future, this is even used to decide when copyright no longer applies, because the original authors are dead by then.

(Thanx to Noel for replying to the mailing list!)

JChemPaint hack session at Uppsala

Arvid and I had a meeting on the ControllerHub refactoring, to make it modular to the bone. Actually, it is the IChemModelRelay that needs refactoring. This is what we wrote down:

In this picture you can see our priorities (1,2 for Arvid, and A,B for me).

1 The above mentioned relay will get refactored to have each method a separate class, which at the same time will provide the undo/redo functionality. We might even have undo at a scripting level :)

2 Second item on Arvid's list is extending the mouse relay to handle key modifiers, an unfortunate omission in that design. This is needed to Ctrl- and Apple's Command-based selection approach.

A To easy the footprint reduction of JChemPaint, we are going to split the current render module into render and renderextra. The second may see future split ups, and both may see name changes, but the first go will split them according to requiring an IAtomContainer data model or a IChemModel data model.

This will help us clean up dependencies, by forcing us to have the core functionality not pull in, for example, reaction functionality. Additionally, isotope related rendering will go into renderextra so removing the dependency on the IsotopeFactory and the associated 800kB-sized isotope.xml data file.

This will not immediately help the applet class partitioning and indexing, but it will help us to keep a sane overview of all the stuff we have around.

B The goal is to merge this work on JChemPaint back into the CDK libary, so that we again have a CDK version with a fully working editing environment as did CDK 1.0. However, that requires to code to be stable, which includes full unit testing, no PMD violations, complete JavaDoc. However, as I wrote this morning to the cdk-jchempaint mailing list:

    But, there is a lot of clean up to do. I counted 497 missing test
    methods; 326 PMD violations, and saw a lot of missing JavaDoc. This
    means, that the current patch is pretty messed up indeed, and we are a
    long way away from seeing a merge with CDK master :(

Friday, June 26, 2009

Michel Dumontier at Uppsala University

Michel visits our group this week and gave a very exciting talk yesterday on the role of ontologies in drug discover. This being ongoing research in our group too, the talk was well received by the audience (which was not too large, because after mid-summer, Uppsala has holiday). First the first time, I microblogged a talk on my twitter account (using the #dumontieratuppsala tag). I have not got a XSLT ready to convert the relevant items into a nice HTML snippet for embedding in this blog, but will try to do that later. Meanwhile, I also made a few bookmarks here and there, which are available from Delicious.

The rest of the day, we talked about various ontology, bio- and cheminformatics related stuff. We looked at SADI, Bioclipse (and my RDF extension, see these JavaScripts), Bio2RDF,, and Virtuoso.

Sunday, June 21, 2009

The Dr Who's of Life Sciences

Peter recently wrote up a model of how several Blue Obelisk (please contribute to the page!) projects changed in history: The Doctor Who Model of Open Source. This was later picked up by Glyn and then by Slashdot (second time Peter got that fame; that's one of the advantages of working at a well-known institute, instead of something like Uppsala University. Beside Bioclipse, GROMACS and the CDK, MySQL AB actually has a headquarters here.) Thanx to Chris who pointed me to the Slashdot coverage.

Now, the several blogs and the Slashdot item contain interesting discussions on whether the 'Dr Who' model is the best model of how open source projects can evolve. Fact is, at least, that the model does not describe a new phenomenon; Peter merely describes a how the Blue Obelisk deals with the limited resources we have in cheminformatics, and that the succession of project leaders ensures both the scientists interest (who are generally not payed for development or, $Foo forbid, maintenance of scientific data analysis methods) as well as the project itself. This makes life science open source different from most pure-IT projects: open source academic software is always something on the side.

So, when Miguel Howard turned to Jmol, he had seemingly unlimited resources to work on Jmol and he had great ideas and made them work: Miguel is the father of the now so popular Jmol applet with scripting functionality. It did mean that the integration with the CDK I worked on, as planned by the original Jmol author Dan Gezelter, Christoph and me in 2000: the CDK data model was too slow (it is amazing how fast Jmol is, without using accelerated graphics! See this Nature Preceedings paper: DOI:10.1038/npre.2007.50.1). My attention was better spend on the CDK.

Now, if the need arises, and the current Jmol head Bob looses interest or time, I'll be available to take over again. That is less likely to happen for an older Dr. Who actor. Several Slashdot commenters also pointed out that the model also matches the 'drummer-in-a-band' model. I guess, or lead-singer... This moved the discussion of what the model exactly models. Peter writes:
    "Instead the Blue Obelisk community seems to have evolved a “Doctor Who” model. You’ll recall that every few years something fatal happens to the Doctor and you think he is going to die and there will never be another series. Then he regenerates. The new Doctor has a different personality, a different philosophy (though always on the side of good). It is never clear how long any Doctor will remain unregenerated or who will come after him. And this is a common theme in the Blue Obelisk."

This brings me back to the earlier observation I wrote down: science is different, and Peter is right when he says you think he is going to die and there will never be another series. This thought is justified for many open source science projects; in Glyn's blog there is the remark of lack of data, but I think if someone would count of the number of dead open source science projects, I think the outcome will be that the fear is highly justified.

This is likely also the power of the Blue Obelisk: it creates a lively and rewarding community with equally minded people, forming an eco-system where the individual projects can flourish. Maybe someone can come of with a good metaphore for the Blue Obelisk, matching the Dr Who model? BBC comes to mind: is the BBC an eco-system where small TV series can survive?

Anyways, my father used to watch Dr Who, and being compared to Dr Who is much more rewarding than being compared to a drummer in some band.

Thursday, June 18, 2009


The Uppsala and EBI CDK-teams have been working hard on finishing the rewrite of JChemPaint I started with Niels earlier. While the EBI-team focused on the applet (and Swing application), the Uppsala team, obviously, focused on the SWT side, for integration into Bioclipse. The new JChemPaint is reaching a useful state, and below is a quick update screenshot something Arvid has been working on:

It shows a periodic table which allows you to drag any element type onto the JChemPaint drawing area. It is using regular drag and drop functionality, allowing you to create any arbitrary pseudo atom too. This also paves the way for a template system, allowing you to drag-n-drop fragments onto an active JChemPaint editor.

Wednesday, June 17, 2009

No, PDFs really do suck!

A typical blog by Peter MR made (again), The ICE-man: Scholary HTML not PDF, the point of why PDF is to data what a hamburger is to a cow, in reply to a blog by Peter SF, Scholarly HTML.

This lead to a discussion on FriendFeed. A couple of misconceptions:

"But how are we going to cite without paaaaaaaaaaaage nuuuuuuuuuuumbers?"
We don't. Many online-only journals can do without; there is DOI. And if that is not enough, the legal business has means of identifying paragraphs, etc, which should provide us with all the methods we could possibly need in science.

Typesetting of PDFs, in most journals, is superior than HTML, which is why I prefer to read a PDF version if it is available. It is nicer to the eyes.
Ummm... this is supposed to be Science, not a California Glossy. It seems that pretty looks is causing major body count in the States. Otherwise, HTML+CSS can likely beat any pretty looks of PDF, or at least match it.

As I seem to be the only physicist/mathematician who comments on these sort of things, I feel like a broken record, but math support in browsers currently sucks extremely badly and this is a primary reason why we will continue to use PDF for quite some time.
HTML+MathML is well established, and default FireFox browsers have no problem showing mathematical equations. For years, the Blue Obelisk QSAR descriptor ontology has been using such a set up for years. If you use TeX to author your equations, you can convert it to HTML too.

We can mine the data from the PDF text. Theoretically, yes. Practically, it is money down the drain. PDF is particularly nasty here, as it breaks words at the end of a line, and even can make words consist of unlinked series of characters positioned at (x,y). PDF, however, can contains a lot of metadata, but that is merely a hack, and unneeded workaround. Worse, hardly used regarding chemistry. PDF can contain PNG images which can contain CML; the tools are there, but not used, and there are more efficient technologies anyway.

I, for one, agree with Peter on PDF: it really suck as scientific communication medium.