Tuesday, August 28, 2012

Errors in Web of Science: anecdotal data

Because I needed to update my CV and publication list, I double checked my own list with my online profiles at CiteULike, Mendeley, Google Scholar, and ResearcherID (derived from Web of Science). The former two I have control over and are significantly more accurate. Peter Murray-Rust tweeted this week that Web of Science only covers some 3%; I do not know about that, but in the life sciences it is more. But far enough from 100% that also for me, not all my research output is captured by Web of Science.

Still, we all know that Web of Science is crucial to our future, because too many institutes use data derived from that database as part of their evaluation of their employees.

In that respect, I would love an equivalent of Retraction Watch: JIF-use Watch, blogging out anecdotal stories about universities, funding agencies, and the likes that misuse the journal impact factor in research(er) evaluations.

Anyway, back to Web of Science. I blogged recently about how it matches up to Google Scholar, and found a near-linear relation between the citation counts, at least partly explained by Google Scholar having a higher coverage of literature and Google Scholar being more up-to-date. In this post, I will focus on another aspect, that of error. Google Scholar captures a certain amount of non-literature, and it is yet hard to filter that out (though I trust the Scholar team to come up with a solution for that).

But one should not think that Web of Science does not have error. Despite the long delays, I doubt they actually have people doing the extraction. Why I think that? Because they miss obvious citation links. For example, I just filed two correction requests for two papers citing the Bioclipse 2 paper, but which were not linked to the entry for that Bioclipse paper in their database.

Of course, it is easy to make mistakes in list of references we put in papers. Indeed, since we have to manually enter the information commonly, or with the help of tools, mistakes creep in. For example, in our CHEMINF paper we miscited a ChEBI paper which appeared on the NAR website in 2009, but got printed in 2010 and got that year as final citation information (if only publishers would just accept a list of DOIs):

And, as you can see, six other papers make that mistake. That means that the citation count for the paper has not 53 but 60 instead. That is quite a deviation indeed. And a quick browsing of my recent papers shows that the problem of unlinked citations is not uncommon.

For example, when I look at our recent Oscar4 paper, a find a variety of problems, besides a miscite of Paula's paper: 15 more references where mistakes are made, such as taking the cited resource's title as source title, a resource incorrectly described as published in 1914, etc.

The best I can do, is send in correction suggestions. Unfortunately, they are not always accepted, and the CHEMINF paper still occurs twice in the database. I must be clear that this is a very limited study, but I doubt that these issues are really specific my situation. However, this creates a new aspect of "gaming": those researchers that spend more time on fixing Web of Science will get higher bibliometric scores. Same for journals, BTW.

If academic institutes would only consider Thomson's work the same way countries use financial qualifications like tripleA: they are opinions (strictly speaking, I'd say they are measurements, and as such have measurement errors, and we all report experimental errors, right?). Did you know that, that the Moodys of this world giving out those qualifications insist themselves they are merely opinions?

Wouldn't it be fun if JIFs would be supplied with a standard deviation? JChemInf: 5 +/- 3, Nature: 36 +/- 26. Cool, wouldn't it? Same for Web of Science / Google citation counts, such as for the first CDK paper: 198 +/- 13. My H-index: 13 +/- 2. Technically trivial. This is purely a political problem.


  1. These problems are widespread and only doing a lot of cited reference searches will bring all the citing papers together.

    There is a larger problem in other disciplines where authors/editors have been using a "standard" set of journal abbreviations that have turned out to not be standard at all. (e.g. JHEP, JCAP, ApJ, etc.) In these cases WoS is missing vast swathes of citations. A few librarians have been tracking this and Thomson are working on it but there is no timeline for its resolution.

    1. I do not mind error in the database; that is indeed inevitable. More importantly is that people must not think WoS is the holy grail... moreover, more transparency by Thomson, e.g. about time lines, would be very helpful there!

    2. Well, I *do* mind error, but error can be fixed. What cannot be fixed with data changes, is the misinterpretation because one does not take into account the error.

  2. My case is worst because my name as three parts. Sometimes, my papers are cited with the right name. Sometimes, the first part of my name is considered as my middle name and only the two other parts are used. As you can imagine, I know many papers that cite my work with the wrong name and that are not included in my citation count. However, this is true both for Thomson's Web of Knowledge and for Google Scholar (However, less frequentely in Scholar). I tried to fix the problem but there is actually no way to do so


    1. The ORCID initiative will help there. Check out

  3. Eventhough IF and H-index are flawed, if it describes a mean value it should have a standard deviation associated to it if you really want to compare with another index. I assume the standard deviation (SD) as shot noise and just use the square root of the index:
    SD(IF) = Sqrt(IF)
    same goes for H-index.