Saturday, September 04, 2010

Data duplication at Mendeley

Earlier this year I gave Mendeley a try, after having been a happy JabRef user, unhappy Connotea user (main problem was that any URI can be bookmarked, not just papers, so very noisy), happy CiteULike user (and still am). But the client did not bring me what I needed, and I canceled my account again.

Since then, Mendeley has undergone a transformation, and there is talk about OpenSourcing the client (or not), Open Data, and an Open Standard API. But, importantly, I no longer need the client and can do everything in the browser.

Moreover, Mendeley has momentum and is starting to provide interesting apps around the API, such as And since being a scientist is playing the publishing game, one just must add once papers to these systems, just advertise them:

This brings us to problem #1: author identity, which is a general problem and addressed by projects like ORCID. So, besides the page shown above, I have a second page under an entry with just my first name.

But, as the title of the post suggests, Mendeley suffers from a second problem, which was recently brought up by Duncan in his How many unique papers are there in Mendeley? post. Mendeley, apparently, claims 36M papers, but the number of unique papers is much smaller, as detailedly outline by Duncan. Mr. Gunn replied that [d]uplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher (see this comment), but I do not buy that.

I replied in the blog about that claim and also made a suggestion: this dereplication should really be a crowd-sourcing event, but I found it impossible to find a place to report duplication, so I had to use a message to support form and a uninformative category Other. If I was working in Mendeley, I would make this reporting a key technology behind their dereplication efforts.

Anyway, the duplication goes deep, very deep into the long tail. And really, my papers are fairly well received in general (many of my papers in BMC journals are 'Highly Accessed'; I did request some distinction there, using the StackOverflow gold, silver, bronze system), but incomparable with the highly bookmarked papers in Mendeley. I know this is probably not something Mendeley likes to hear, but the paper duplication goes deep, very deep too: a majority of my papers show duplicates. A semi-exhaustive scan showed me duplication for the XMPP paper (here and here), the Blue Obelisk paper (here, here, and here; yes, three copies), the CDK-Taverna paper (here and here), the Bioclipse 2 paper (here and here), the userscripts paper (here and here), the CDK I paper (here and here), and the CDK II paper (here and here).

Hopefully, by the time you read this post, at least some above the above links no longer work. In that respect, I would also like to request URIs based on the DOI instead.