Tuesday, August 28, 2012

Errors in Web of Science: anecdotal data

Because I needed to update my CV and publication list, I double checked my own list with my online profiles at CiteULike, Mendeley, Google Scholar, and ResearcherID (derived from Web of Science). The former two I have control over and are significantly more accurate. Peter Murray-Rust tweeted this week that Web of Science only covers some 3%; I do not know about that, but in the life sciences it is more. But far enough from 100% that also for me, not all my research output is captured by Web of Science.

Still, we all know that Web of Science is crucial to our future, because too many institutes use data derived from that database as part of their evaluation of their employees.

In that respect, I would love an equivalent of Retraction Watch: JIF-use Watch, blogging out anecdotal stories about universities, funding agencies, and the likes that misuse the journal impact factor in research(er) evaluations.

Anyway, back to Web of Science. I blogged recently about how it matches up to Google Scholar, and found a near-linear relation between the citation counts, at least partly explained by Google Scholar having a higher coverage of literature and Google Scholar being more up-to-date. In this post, I will focus on another aspect, that of error. Google Scholar captures a certain amount of non-literature, and it is yet hard to filter that out (though I trust the Scholar team to come up with a solution for that).

But one should not think that Web of Science does not have error. Despite the long delays, I doubt they actually have people doing the extraction. Why I think that? Because they miss obvious citation links. For example, I just filed two correction requests for two papers citing the Bioclipse 2 paper, but which were not linked to the entry for that Bioclipse paper in their database.

Of course, it is easy to make mistakes in list of references we put in papers. Indeed, since we have to manually enter the information commonly, or with the help of tools, mistakes creep in. For example, in our CHEMINF paper we miscited a ChEBI paper which appeared on the NAR website in 2009, but got printed in 2010 and got that year as final citation information (if only publishers would just accept a list of DOIs):

And, as you can see, six other papers make that mistake. That means that the citation count for the paper has not 53 but 60 instead. That is quite a deviation indeed. And a quick browsing of my recent papers shows that the problem of unlinked citations is not uncommon.

For example, when I look at our recent Oscar4 paper, a find a variety of problems, besides a miscite of Paula's paper: 15 more references where mistakes are made, such as taking the cited resource's title as source title, a resource incorrectly described as published in 1914, etc.

The best I can do, is send in correction suggestions. Unfortunately, they are not always accepted, and the CHEMINF paper still occurs twice in the database. I must be clear that this is a very limited study, but I doubt that these issues are really specific my situation. However, this creates a new aspect of "gaming": those researchers that spend more time on fixing Web of Science will get higher bibliometric scores. Same for journals, BTW.

If academic institutes would only consider Thomson's work the same way countries use financial qualifications like tripleA: they are opinions (strictly speaking, I'd say they are measurements, and as such have measurement errors, and we all report experimental errors, right?). Did you know that, that the Moodys of this world giving out those qualifications insist themselves they are merely opinions?

Wouldn't it be fun if JIFs would be supplied with a standard deviation? JChemInf: 5 +/- 3, Nature: 36 +/- 26. Cool, wouldn't it? Same for Web of Science / Google citation counts, such as for the first CDK paper: 198 +/- 13. My H-index: 13 +/- 2. Technically trivial. This is purely a political problem.