Friday, February 28, 2014

Ignoring spammy publishers

We have all been there: a publisher that just ignores your requests to be removed from the informative emails but keeps sending you invitations for conferences and new fancy journals. Hard to imagine, but just imagine you are no longer interested in sharing your valuable expertise with this publisher... doable? If you have Gmail, it is.

Search for all emails from the domain you like to ignore. I pick a random example domain for no particular reason: from:(*@omicsonline.*). Then, select the drop down icon at the right side of the search box:

You then get a dialog like this, where you select "Delete it" and "Also apply filter to...":

Monday, February 24, 2014

Journal rankings and Expectations

Publishing to me is getting the message out. I want to contribute to a better world, via my work. I do this by making methods more precise and by reducing error. You probably spotted that theme in my publications. The impact of my work is reflected by how often people reuse my work (extend, use, ...). A flawed reflection of that is the number of citations of my publications; better would be the number of projects that use my work. For example, tools that use the CDK (which has many more authors!), like Bioclipse, AMBIT,  LICCS (CDK in Excel), KNIME, and many more. Actually, the number of times these projects get cited, indirectly also contribute to the impact of my work.

"We" don't count that. Instead, "we" tend to focus on the Journal Impact Factor (JIF). There is plenty of material around that discusses how flawed that is in assessment, particularly for assessing scientists. But I rather explain now why I still think about it. Because it is part of how I, our research group, our research institute, and our university is assessed. I do not have to agree with the overhyped role of the JIF, but I cannot go around it either. Well, actually, I can to some extend.

At the moment, I have identified the following methods our work is being assessed, e.g. by national funding agencies and organizations that want to ensure good science in The Netherlands.
  1. papers in journals with a JIF >= 5
  2. papers in journal that rank in the Journal Citation TOP1, TOP10, and TOP25
I have seen worse. I have attended universities where the JIF is directly involved in the calculation. I have also seen better, and have attended universities that do actually count the number of times my papers are cited, and still proud to have five papers in the top 5% most cited papers.

Now, when using these guidelines practically, you found some interesting facts. For example, only very few journals are TOP1. In fact, Science is not; it is (JIF >=5 && TOP10). Not so many journals have a JIF >= 10, but >= 5 is not that uncommon. The biggest problem here is that this cut off is field specific, and >= 5 is trivial in biology, but hard in chemistry.

Combine that, you get a situation where many Open Access journals (and OA matters to me) are JIF >= 5 && TOP10, putting it, for the rankings, at par with Science.

Now, it gets better. Taking the above assessment rules into account, have a look at PLoS Biology. It has a JIF >= 5 (>= 10 even) and is TOP1! Thus, it outranks Science. Of course, there is also the implicit rule that Nature and Science go above all, but maybe the assessors should start to think about what they really want to care about, and what they really should be expecting from scholars.

Now, go read my latest preprint instead of wasting your time on rankings.

Friday, February 21, 2014

Slow publishing innovation: SMILES in ACS journals

Elsevier is not the only publisher with a large innovation inertia. In fact, I think many large organizations do, particularly if there are too many interdependencies, causing too long lines. Greg Laundrum made me aware that one American Chemical Society journal is now going to encourage (not require) machine readable forms of chemical structures to be included in their flagship. The reasoning by Gilson et al. is balanced. It is also 15 years too late. This question was relevant at the end of the last century. The technologies were already more advanced than what will now be adopted. 15 years!!! Seriously, that's close to the time it takes to bring a new drug on the market!

Look at what they suggest and think about it. Include SMILES strings for structures in the paper. I very much welcome this, of course, despite I am not a big fan of SMILES at all. They could have said something about OpenSMILES too, which is more precise. They do say something about the InChI and InChIKey, but not that the SMILES string can more precisely reflect the drawing. I wonder why they don't go for a format that can actually capture the image, like CML or a MDL molfile. Then again, a SMILES copy/pastes so nicely. Talking about slow innovation. There is zero technical reason you could not copy/paste a MDL molfile into a spreadsheet (and you can with many tools, in fact...)

Now, I still have tons of questions. What tool will be used to validate the correctness and absence of ambiguity before the publication? Will the SMILES strings be validated at all? And at what level? Will it have to be compatible with particular tools? Does it have to be compatible with OpenSMILES? Under what license will these SMILES be available (can we data mine DOI-SMILES links and openly share them)? What was the reasoning for finally adopting this? Will the journal also accept submission where both SMILES and other formats are provided? Will they accept or deny SMARTS strings (e.g. for Markush structures)?

All in all, I second the others, and am happy to see this step. I do hope they do not stop here and wait again 15 years for another step. In fact, they ask for input on That is double promising!

ResearchBlogging.orgGilson MK, Georg G, & Wang S (2014). Digital Chemistry in the Journal of Medicinal Chemistry. Journal of medicinal chemistry PMID: 24521446

Saturday, February 15, 2014

Elsevier's new text mining initiative is a step sideways

Elsevier's new ideas on text mining are getting a lot attention now. Sadly, they get it wrong, again. On the bright side, all other publishers, which are expected to follow this year, can learn from this mistake.

Because if done right, the publishers can even help forward science, despite crippling progress. That sound harsh, and surely they have done a lot of good for science. In fact, we would not be where we are now without the publishers. But things have changed. With the internet anyone can be publisher. We see this with blogs, we see this with And, unlike some misinformed people think, this is independent from peer review. Publishers were important because they provide a channel to disseminate knowledge. But paper publishing is no longer the most efficient way. In fact, in terms of value, paper has been overtaken for some years now.

And we need more added value. Not the shipping of the knowledge, but keeping up is the issue. And there too, publishing is inefficient: human language is nice for sharing ideas and concepts, but it fails at disseminating raw facts: measured data. Anyone who has tried creating a data set to find patterns knows this: extracting the information is a lot of effort, mostly caused by the broken paper publishing model. This is most apparent in some research domain where data repositories exist, but sadly this applies to a small minority of data types.

Now, text mining seems in that sense the wrong question: why trying to recover knowledge that should have gone into repositories in the first places. I agree. However, we cannot just throw away all the knowledge kept in these papers, and certainly not as long as people keep insisting on seeing only papers as scientific success. We are slowly seeing this improve, but only very slowly. Things that were apparent to me as a student 20 years ago, are the things that scholars are still struggling with today. Depressing indeed, but it does help you grow a good sense of patience.

And now, Elsevier wants to make a step forward, wants to be leading in science dissemination again. And they come up with an intermediate solution between actual knowledge dissemination and profit: they come up with a license-model, increasing their monopoly on knowledge and trying to lure the scientist into a non-commercial license. From a money-making perspective this is what society expects from them. From someone who likes to see societal problems solves, this is disappointing. They had a great opportunity to lead the field.

Now, is all bad? Not at all. It's a step, but not the step I would have liked to see. It will be a success: because the CC-BY-NC data that will come out of it, will be part of the web of knowledge. No one will care about the NC part, except all those SMEs in Europe that work on products to help society which will find it much harder to collaborate with other companies, because they cannot share the knowledge the created from analyzing the literature (does Elsevier want a monopoly in this analysis?).

Nor will many in the academic community complain. Surely, those that have worried about this, they will. But the scholar at universities do not care about NC licenses. After all, universities are not commercial. Asking a student to pay 30 thousand euro for a year is surely not commercial. That is the consensus. But I note that this consensus has not be tried in court, and I am looking forward to the day it will happen. Elsevier will likely not challenge this, and silently accept this situation. Just like Microsoft never made a big deal out of people copying office versions of their operating system for at home: you do not bit the hand that feeds you (too hard). You rather go after others, like It will not be scholar Elsevier will enforce the NC on, and it will not be large companies either: if any, it will be the SMEs. Support them, and do not agree with the license.

Well, it was a nice opportunity for Elsevier. I only see my choice to sign The Cost of Knowledge reaffirmed.

The choice of the NC clause is totally useless in any context of dissemination. I call for Elsevier to at least add this option, if they are serious about improving: text mining is provided to subscribers, via a decent API, adhering to:
  1. Facts extracted from literature are licensed CCZero and attribution is paid (facts are copyright free in most parts of the world)
  2. Output can contain "snippets" of the original text under international "fair use" concepts, and licensed as CC-BY
Any scientist is expected to attribute the source of information in the first place, and it is kind of sad Elsevier is on such bad foot with their audience that they feel this must be enforced via a contract, but that is not a problem. I also see no reason to deviate from international law about "fair use"; I do understand this is probably an ill defined concept, but 200 characters seems pretty limited to me, as facts can be spread of sentences longer than this.

I know that many will disagree on the CCZero license, and many will feel awkward about giving away data. It has value, right? It's your property, right? I am not going to argue against that. But personally I do not understand how it aligns with the idea of scientific dissemination. Holding back knowledge as part of making knowledge available? How exactly does that make sense? Importantly, just like with software, Open is not the same as Without-Cost! Hosting and sharing Open Data also costs money (particularly, if it is 1 TB of data). Those are different concepts.

However, I also stress that the scholars have a great responsibility hear: I call for all Elsevier journal editorial boards to not accept this deal either. In fact, all editorial boards have great say in this: it's them who make a journal valuable. I also call all scholars to be aware the consequences of selling away your copyright. That is a choice in the current era. There are plenty of means to disseminate your science *without* (much) cost, and APC is a flawed argument.

The current step by Elsevier, after all the effort from many, is not a step forward, it's a step sideways. Elsevier, I know you can do better. Are you willing?

I am willing, and have been supporting science by making data available as CCZero. However, I also am happy if others are not ready for this, or have other reasons not to. It is not always under their control. For example, I have heard stories where data has been used by politicians as small change to get industry to test their products for safety. I also accept that getting funding as a scholar is hard work, often not paid for, and that it is hard to give away your only security of a future career. Then again, we all know what data is valuable, has already given its value, or is of no use to you anymore. And this latter case I ask you to consider to make data available: data of no use to you anymore, but that could be valuable to others. Make it available, and get cited, and get value out of it, you would not have received when it sat on some hard disk, and probably is lost in five years.

I also fully understand this is my opinion. Thus, not all data I make available is CCZero: I fully respect copyright and license from others; in fact, I often feel I do much more than scientists which object to Open licenses, which just take data as their own as they please. That is why I insist often on clear copyright and license information. Because if missing, default (local) law applies.

If you want to read more analysis, please refer to the following posts:
  1. Elsevier opens its papers to text-mining
  2. #elsevier’s TDM Terms (TaC): Can they force us to copyright data? (2)
  3. Nature’s recent “news” article on Text and Data Mining was unacceptable [redacted]; I ask them to renounce licensing.
  4. "Dear Peter,", Richard van Noorden
  5. Reply to Richard van Noorden

Saturday, February 08, 2014

How Open Access are you?

Open Science is not just the right thing for science, it is also just fun. It enables collaborations, quick solutions by working together with experts, and actually solving questions you had not planned on solving. Part of Open Science is about making it as easy as possible for others to extend your work. Does someone have to buy a full subscription to access your research, or spend 50 euro for your single paper? If you ask the author, you likely get a reprint, though some publishers are not happy about that. Furthermore, you may just need that paper now, and the author has a well-deserved holiday.

Interestingly, 100 years ago, paper publishing was the quickest and most efficient way to let the world know about your discoveries. Well, that was 100 years ago. Traditionally publishing has fallen behind, and is by now the slowest way to disseminate your discoveries. About anything is faster now. Hell, commonly giving the presentation is faster (which paper has not seen a conference contribution about that topic earlier?).

And then the sharing of the discovery. Can someone share your finding locally in a presentation? Can it be used in slides for a course? Again, you can ask and buy such permission with Closed Access publications. But it is just inefficient. This slow down costs a lot of money, that could be spent on new research instead.

So, when someone comes up with a way to make publishing more efficient, I welcome that. And to set a good example, I now try to publish Open Access myself as much as possible (and sometimes that goes wrong.) And, I also welcome incentives, even if small, to promote OA publishing. And that's what ImpactStory did with a nice, small award banner. And I'm happy that enough of my papers are available to get the gold award, though I have to cheat a bit with a few green access papers.

Also welcome is the improved ACS AuthorChoice program, which allows you to make past papers available under a CC-BY license. This allows me to fix mistakes from the past. More on that in a while...