Pages

Friday, December 23, 2016

Facts, Data and Open Data


Source, CCZero.
I was recently asked my experiences around data sharing, and in particularly the legal aspects of it. Because whether we like it or not (I think "we" generally do not like it and I see many scholars ignore it), society has an impact on scholarly research. Particularly, copyright and intellectual property (IP) laws make research increasingly expensive. I wrote up the following aspects related to that discussion. I am not a lawyer, and these laws are different in each country (think about facts, governmental output, etc). Your mileage may vary.

#1 Don't give away your copyright to any single other party

Scholars are common to this. For a very long time we would freely give our research IP to publishers. By selling that IP, publishers would fund the knowledge dissemination (often with huge profits). But institutes start thinking about this, and are backtracking on it. Bottom line: do not give away your copyright.

The importance of this is that you will loose all control over the data. You will no longer be able to give your data to others, because it is no longer yours. Also, you can never repurpose the data anymore, because it is no longer yours. Instead, give other rights to work with the data, by removing copyright or by giving people a suitable license (see the next point).

#2 The three pillars of Open: the rights to (re)use, modify, and redistribute

Really, these three points are critical: it gives anyone the rights to work with the data.

(Re)use is clear.

The right to modify is critical because it is needed for changing the format in which the data is shared (e.g. create ISATab-Nano) but also for data curation!

Redistribute is the right that anyone needs to make your data available to others. In fact, all those EULAs (end-user license agreements) that all of us sign when creating an online account give Google, Facebook, etc, etc the right to reshare (some of) the data you share with them. Clearly, without this right, ECHA, eNanoMapper, CEINT, etc cannot reshare the data with others.

#3 Copyright

Copyright law around data is very complex. For example, there are huge differences between law in European countries and in the USA. The latter, for example, have the concept of "public domain" that many European countries do not have (though we still happily use that term here too). In Europe, databases have database rights. Facts are excluded, but I have yet to find a clear statement of what a "fact" is. But a collection of facts is the outcome of a creative process (like any EC FP7 or H2020 project) and hence has copyright.

For starting projects, the consortium agreement (CA) defines how this is dealt. And like you can give the copyright of a research paper to a publisher, a CA can define that all partners of a project have shared IP. That ensures they can all use it, but it also means it becomes really hard to share it outside the consortium. Instead, my recommendation is to keep the IP with the data creator, and make it available within the consortium with a license. Or just waive the copyright. Copyright with one legal department can already be complicated, and if you have multiple legal departments discussing IP, it certainly does not become easier.

Of course, consensus among all partners is best. I also stress that laws are just tools. Any partner can give others more rights without problems. They cannot hide behind laws. Ideally, each project proposal writing starts with a formal consensus how data will be available. Solve that before you get the money. But I will write more about that later during these holidays.

#4 Licenses and waivers

The open source community realized these issues decades ago. First with source code, leading to Open Source Initiative (OSI)-approved licenses, providing the aforementioned rights. For source code, there are also so-called waivers. The difference with licenses is that the latter gives you specific rights, while a waiver "waves" away any rights any law (from any jurisdiction) might automatically give. For the three "pillars" the outcome is the same: you will have those three rights. In case of a waiver, you just get any right you can think of too, whereas a license is limited to those rights specified in the license.

Now, these ideas developed in the open source community found their way to the "Open Access" (OA, for documents) community and the "Open Data" community in the last 10 years. Some lobbying forces managed to clutter the definition of Open Access, which is why the community talks about green OA and gold OA. The first is not really Open and does not give you all three rights. Gold Open Access does. A green OA article you cannot reshare.

For data there are basically two options:

  • licenses: Creative Commons (CC) license
  • waiver: CCZero (not a licence)

For the first option, the licenses, the CC licenses come in various flavors, and this is implemented with "clauses". For example, there is an "attribution" clause. This creates the CC-BY license as you know from gold Open Access journals. This clause gives you the three rights, but also requires you to cite where you got the data.

A second CC clause is the ND (No Derivative) clause, which defines that no one can make derived products. Effectively, it removes on of the three rights. It exists with the idea that some things are not meant to change. Think for example about the JRCNMxxxx codes for nanomaterials. No one should be changing them, because it would defy the purpose of the definition of those codes.

A third CC clause is the NC (Non-Commercial) clause. This clause specifies that you can only use that data for non-commercial purposes. Some publishers use that in their implementation of "Open Access" and basically says that only some people get the three basic rights. Now, who "some" is, is not clearly defined. Not legally, not practically. No one really knows when something is commercial and when not. Some legal experts have argued that some American universities are commercial enterprises (source needed). For Europe SME's are a clearly commercial entity.

A final CC clause is the SA (Share Alike) clause, which requires that people redistributing your data also make it available under the same license. This is in the open source community referred to as "copylefting" and has upsides and downsides.

I stress that in case of licenses, no IP is reassigned and the producers of the data keep owner of the IP.

At a recent NanoSafety Cluster meeting I gave a presentation about these matters and the slides are available here.