Monday, May 18, 2009

Open Data: license, rights, aggregation, clean interfaces?

A recent post by Cameron on his visit last week with Nico, Peter and Jim, discussed Open Data licensing. This lead to an interesting discussion on these matters, and questions by me on why people care so much about only public domain data (or licensed with PDDL or CC0).

Open licensing for data has not as much matured as for software, and international law seems to be more confusing about the issues. I guess that is because data aggregation has been around for way before the computer era. The PDDL and CC0 both try to overcome this fuzziness. But there is another issue we need to keep in mind. A lot of useful Data was aggregated and made Open before these licenses came about, and use, for example, the GNU FDL license, such as the NMRShiftDB.


Right now, there are two Open Data camps, much like the BSD-vs-GPL wars in Open Source: one that believes in waiving any rights on the Data, indicating that facts are free; others that believe that data must be protected to not be eaten by big companies and lost to the community (e.g. the WolframAlpha arragnements are suspect).

Of course, both camps are not that far apart, and both believe Open is important. Interestingly, there are some noteworthy differences with the Open Source wars. I see parallels between the two, which details an important difference: Open Source has algorithms (uncopyrightable) and implementations (copyrightable); Open Data has Data (uncopyrightable) and aggregation (copyrightable). Open Source talks mostly about the implementation, not the algorithm; it's Open Source, not Open Algorithms after all. In cheminformatics it is even often the case that the algorithms are not even specified and that there only truly is source.

However, Open Data in title does not make distinction. Data is fairly cheap and acquisition can be automated and computerized; Aggregation, on the other hand, requires human involvement: curation and thinking about data models, etc. This is where added value is. Consider an assigned NMR spectrum or the raw data returned from the spectrometer.

It is this added value that people want to protect, not the data itself. I think.


One important argument that tend to show up when people argument for PDDL and CC0 is that it makes data aggregation easier. This is most certainly true: if you can do whatever you like with a blob of data, that also means aggregate with any other blob of data. However, copyleft licenses, like the GNU FDL, require the aggregation to have a compatible license too. It is the license incompatibilities that make this impossible. Or ... ?

Open Source has matured to such a point that it is fairly clear what the intended behaviour is, regarding derivatives. An aggregation of software (typically refered to as a distribution) is only a derivative under certain conditions. This makes it possible to run proprietary software on top of GNU/Linux, which uses the GNU GPL but does not require software to run on top of it to be GPL too. Unless... unless, not a clear well-defined interface has been used, indicating a strong dependency. Now, surely, these things have not been confirmed to match actual law in court, but the intentions are clear.

Clean Data Interfaces?

Now, if we would translate this to Open Data, would there be the equivalent of a clean interface? Can we build a data distribution with data of various licenses? I think we can! I am not a lawyer and please consider this an invitation to discuss these matters...

Let's start simlpe... if I put a GNU FDL image in this blog, by linking to it with a open, free, clean HTML interface (<img src=""/>), would that make my blog GNU FDL too? I don't think so. Surely, I would need to list copyright owner, and actually would be required to put the GNU FDL in my blog too, but hope linking to the license text would suffice too. (Let's skip fair use at this moment, and assume the use goes beyond fair use). Question: am I not using a clean interface, and would this not make the image's license no infect my blog?

A more difficult example, consider, which surely aggregated facts, including data from the NMRShiftDB and DBPedia. I am using a unique identifiers here, the NMRShiftDB compound ID, and the DBPedia URL, which surely is GNU FDL, and use this to make a <owl:sameAs> statement. Again, please do not consider fair use, which this certainly is. But, let's say I put in some more DBPedia and NMRShiftDB data in this aggregation. The GNU FDL data on would be separate RDF blocks, with proper dc:license, dc:author annotation. But the block would be part of a larger aggregation. The clean interface here is Resource Description Framework.

This second case does not only affect my website, but, for example, is also in the same situation and aggregated and distribute DBPedia's GNU FDL data (e.g. hexinanose. Does that make the whole of bio2rdf database GNU FDL. They too use RDF as clean interface.

Call for Discussion

Despite what one of the two camps like to see, the mere fact of added value when making data aggregations will keep copyleft license stay around, and instead of trying to convince everyone of the virtues of PDDL- and CC0-like licenses, we should think about to what extend it really matters.

I can do my data analysis with data sources of various licenses. I can search and retrieve data from various sources with various licenses. What obstacles are really there that disallow us to do science? Do the data interfaces we have now not provide enough technical means to address the license incompatibilities? They have in Open Source, why would that not apply to Open Data too?


  1. Hi Egon,

    Have you seen the announcement about Wikipedia being dual licensed (soon) between GFDL and the CC-BY-SA license?

    About your point about redistributing data and the licenses on it. We segment the overall Bio2RDF dataset into downloads and endpoints based on the datasources, so you can avoid getting information from one database by not querying that endpoint. It becomes a little complicated when there are multiple references pulled in, but noone seems to think that linking to a piece of information (such as a DBpedia URI) makes your content come under the DBpedia license, as far as I have heard anyway.

    I am not sure what license you would apply to things like the Amazon S3 combined Bio2RDF virtuoso backup set that Openlink provide since they lose the distinction completely that there are different datasources which make up the Bio2RDF information base and so there has to be a consideration of the licenses that each are provided under after that.

  2. Hi Peter,

    Yes, I have seen the WP dual-licensing, and made my vote earlier.

    Linking is indeed a fairly save operation... it makes fuzzier when you use identifiers native to a particular database... e.g. using the CAS registry number for linking is not allowed on large scale, without written approval.

    I think it is largely undefined what aggregation is, but intuitively, having one point of access shows aggregation. So, GNU FDL data aggregated CC-BY data into one database sounds like an interesting...

    I think putting the data in different databases with different SPARQL entry points, defines a very clear interface, and will not aggregate the data in the above sense. Bio2RDF seems to do this pretty well indeed.