Thursday, March 24, 2011

Supplementary files, publishing, and standards

The publishing world is slowly changes. Things a small community has been screaming for more than a decade now (and possible before that), that data standards in publishing are inadequate. PDF has not helped (fortunately there are replacement initiatives). Even new journals do not do everything right from the start, but at least there is the effort, such as I discussed in these posts:
This week BioMed Central's Iain asked the community how to put their Open Data initiative into practice. There are some good points in the write up, such as:
    Editors and publishers are acutely aware of the limited pool of peer reviewers who are increasingly called upon to help try and ensure the integrity of the published record. The online availability of research data as a supplementary (additional) files has prompted debate about the role of peer review in this non-written material, and indeed the role of journals in publishing this material.
I can very much relate to this problem, and spent about an hour this morning reviewing Additional files to a paper I was reviewing for BioMed Central. And I had quite a few comments, and overall, the section was inadequate.

I tried to reply in the blog, but my comment was marked as SPAM because it had more than 1000 characters (update: this seems to have been manually fixed now, thanx!). So much for constructive comments :) So, here goes:
    Dear Iain,

    thank you for this interesting and important post! I absolutely agree that some standards need to be set. Scientists have been unable to do this, and publishers can distinguish themselves from competition in doing this right.

    Without going into detail what 'right' is (I have very strong opinions on that :), what is important for BioMedCentral right now, is put the advantages so closely in front of the scientist, they can no longer ignore it, or say 'whatever' (which they do now).

    BMC must therefore demonstrate what this reuse, reproducibility, etc, practically means. So, Goal 0 must be: do something with the 'additional files': process them yourself and 1) associate every single additional file with facts about that file; 2) index them, and create a search engine to search additional files based on their content, *cross* all BMC journals; 3) provide alternative download formats, showing what it means to use Open Standards.

    Open Data is not the goal; it's the means to do science better.

    About 1. Every additional file should have a separate web page (or page section), listing not just size, but also the exact format (MS-Excel 2000, rather than 'Excel'... versioning matters!), metadata present in that file (author, creation data, does it have Macro's defined, etc), and statistics about that file (number of sheets in the spreadsheet, number of filled cells, etc).

    About 2. It is of utmost importance that we can discover this supplementary information, and it must be easy to search for stuff using free text (e.g. I want to find all additional files across all BMC journals that have 'tryptamine' somewhere in the additional file, even if that information is stored in Excel files *inside* zip files. Current technology makes that very easy, such as Strigi.

    About 3. As reuse is the key here, the use of Open Standards are important. This could be stressed by showing that files with Open Standards can easily be interconverted, such as spreadsheets into CSV or HTML tables. Just alternative download formats makes the 'Additional files' more useful, and encourages the authors to ensure that they provide data in the right formats.