Pages

Saturday, September 04, 2010

Data duplication at Mendeley

Earlier this year I gave Mendeley a try, after having been a happy JabRef user, unhappy Connotea user (main problem was that any URI can be bookmarked, not just papers, so very noisy), happy CiteULike user (and still am). But the client did not bring me what I needed, and I canceled my account again.

Since then, Mendeley has undergone a transformation, and there is talk about OpenSourcing the client (or not), Open Data, and an Open Standard API. But, importantly, I no longer need the client and can do everything in the browser.

Moreover, Mendeley has momentum and is starting to provide interesting apps around the API, such as readermeter.org. And since being a scientist is playing the publishing game, one just must add once papers to these systems, just advertise them:

This brings us to problem #1: author identity, which is a general problem and addressed by projects like ORCID. So, besides the page shown above, I have a second page under an entry with just my first name.

But, as the title of the post suggests, Mendeley suffers from a second problem, which was recently brought up by Duncan in his How many unique papers are there in Mendeley? post. Mendeley, apparently, claims 36M papers, but the number of unique papers is much smaller, as detailedly outline by Duncan. Mr. Gunn replied that [d]uplicates are understandably enriched among the popular papers, such as yours, and it’s harder to go from 6 duplicates to 1 canonical document than from 2 to one, because the variability is higher (see this comment), but I do not buy that.

I replied in the blog about that claim and also made a suggestion: this dereplication should really be a crowd-sourcing event, but I found it impossible to find a place to report duplication, so I had to use a message to support form and a uninformative category Other. If I was working in Mendeley, I would make this reporting a key technology behind their dereplication efforts.

Anyway, the duplication goes deep, very deep into the long tail. And really, my papers are fairly well received in general (many of my papers in BMC journals are 'Highly Accessed'; I did request some distinction there, using the StackOverflow gold, silver, bronze system), but incomparable with the highly bookmarked papers in Mendeley. I know this is probably not something Mendeley likes to hear, but the paper duplication goes deep, very deep too: a majority of my papers show duplicates. A semi-exhaustive scan showed me duplication for the XMPP paper (here and here), the Blue Obelisk paper (here, here, and here; yes, three copies), the CDK-Taverna paper (here and here), the Bioclipse 2 paper (here and here), the userscripts paper (here and here), the CDK I paper (here and here), and the CDK II paper (here and here).

Hopefully, by the time you read this post, at least some above the above links no longer work. In that respect, I would also like to request URIs based on the DOI instead.

8 comments:

  1. I think it would be interesting to look further at the relationship between popularity and duplication, but let's not get caught up in trying to estimate numbers for something that's changing so rapidly.

    We've begun to address the existing dupes, and, as you might have guessed, we are also looking to crowdsource the Dupuis detection. There's a working demo of this already I can show if you run into me.

    ReplyDelete
  2. Very much looking forward to that, particular now that Mendeley seems to become more Open every day!

    ReplyDelete
  3. now, that's a bit embarrassing, but I'll definitely try to merge alternate spellings as of the next upgrade

    ReplyDelete
  4. These duplications are difficult. Some can be caught, but there should also be good, easy means to have the Social Web remove duplication, both in paper space as in author space.

    ReplyDelete
  5. Egon, glad to see I'm not the only one with duplicates, and didn't realise it was a "known issue" to the extent you describe. BTW identifying authors accurately is even harder than identifying individual papers - see http://pubmed.gov/20072710

    ReplyDelete
  6. @Duncan: that databases need curation is known; they actual errors can differ from one database to another; citation databases suffer from duplication.

    Regarding the author identity, yes, that is harder. Same initial, last name combination may be different authors. But this is where Mendeley's database come in, which has a 'My Publication' section; I'd say they have all the technical means to address author identity.

    @Mr Gunn: is Mendeley formally involved in the ORCID effort, or going to implement it anyway?

    ReplyDelete
  7. Yes, Mendeley is a participating member in ORCID. We plan to support ORCID in Mendeley as well, although it's a little too early yet to say how it will be implemented.

    (You can tell I wrote my earlier comment from my phone because Adroid adds the names of people in your contacts to the autocomplete dictionary, hence the autocorrect of "dupe" to "Dupuis", as in science librarian John Dupuis.

    ReplyDelete