Saturday, June 02, 2018

Supplementary files: just an extra, or essential information?

I recently had a discussion about supplementary information (additional files, etc):

The Narrative
Journal articles have evolved in an elaborate dissemination channel focusing on the narrative of the new finding. Some journals focus on recording all the details to reproduce the work, while others focus just on the narrative, overviews, and impact. Sadly, there are no databases that tell you which journal does what.

One particularly interesting example is Nature Methods, a journal dedicated to scientific methods... one would assume the method to be an important part of the article, right? No, think again, as seen in the PDF of this article (doi:10.1038/s41592-018-0009-z):

Supplementary information
An intermediate solution is the supplementary information part of publications. Supplementary information, also called Additional files, are a way for author to provide more detail. In an proper Open Science world, it would not exist: all the details would be available in data repositories, databases, electronic notebooks, source code repositories, etc, etc. The community is moving in that direction, and publishers are slowly picking this up (slowly? Yes, more than 7 years ago I wrote about the same topic): supplementary information remains.

But there are some issues with supplementary information (SI), which I will discuss below. But let me first say that I do not like this idea of supplementary information at all: content is either important to the paper, or it is not. If it is just relevant, then just cite it.

I am not the only one who finds it not convenient that SI is not integral part of a publication.
Problem #1: peer review
A few problems is that it is not integral part of the publications. Those journals that still do print, and there likely lies the original of the concept, do not want to print all the details, because it would make the journal too thick. But when an article is peer-reviewed, content in the main article is better reviewed than the SI. Does this cause problems? Sure! Remember this SI content analysis of Excel content (doi:10.1186/s13059-016-1044-7)?

One of the issues here is that publishers do not generally have computer-assisted peer-review / authoring tools in place; tools that help authors and reviewers get a faster and better overview of the material they are writing and/or reviewing. Excel gene ID/data issues can be prevented, if we only wanted. Same for SMILES strings for chemical compounds, see e.g. How to test SMILES strings in Supplementary Information but there is much older examples of this, e.g. by the Murray-Rust lab (see doi:10.1039/B411033A).

Problem #2: archiving
A similar problems is that (national/university) libraries do not routinely archive the supplementary information. I cannot go to my library and request the SI of an article that I can read via my library. It gets worse, a few years ago I was told (from someone with inside info) a big publisher does not guarantee that SI is archived at all:

I'm sure Elsevier had something in place now (which I assume to be based on Mendeley Data), as competitors have too, using Figshare, such as BioMed Central:

But mind you, Figshare promises archival for a longer period of time, but nothing that comes close to what libraries provide. So, I still think the Dutch universities must handle such data repositories (and databases) as essential component of what to archive long time, just like books.

Problem #3: reuse
A final problem I like to touch upon is very briefly is reuse. In cheminformatics it is quite common to reuse SI from other articles. I assume mostly because the information in that dataset is not available from other resources. With SI increasingly available from Figshare, this may change. However, reuse of SI by databases is rare and also not always easy.

But this needs a totally different story, because there are so many aspects to reuse of SI...