Wednesday, August 08, 2018

Green Open Access: increase your Open Access rate; and why stick with the PDF?

Icon of Unpaywall, a must have
browser extension for the modern
Researchers of my generation (and earlier generations) have articles from the pre-Open Access era. Actually, I have even be tricked into closed access later; with a lot of pressure to publish as much as you can (which some see as a measure of your quality), it's impossible to not make an occasional misstep. But then there is Green Open Access (aka self-archiving), a concept I don't like, but is useful in those situations. One reason why I do not like it, is that there are many shades of green, and, yes, they all hurt: every journal has special rules. Fortunately, the brilliant SHERPA/RoMEO captures this.

Now, the second event that triggered this effort was my recent experience with Markdown (e.g. the eNanoMapper tutorials) and how platform like GitHub/GitLab built systems around it to publish this easily.

Why this matters to me? If I want to have my work have impact, I need people to be able to read my work. Open Access is one route. Of course, they can also email me for a copy the article, but I tend to be busy with getting new grants, supervision, etc. BTW, you can easily calculate your Open Access rate with ImpactStory, something you should try at least once in your life...

Step 1: identify which articles need an green Open Access version
Here, Unpaywall is the right tool, which does a brilliant job at identifying free versions. After all, one of your co-authors may already have self-archived it somewhere. So, yes, I do have a short list, one one of the papers was the second CDK paper (doi:10.2174/138161206777585274). The first CDK article was made CC-BY three years ago, with the ACS AuthorChoice program, but Current Pharmaceutical Design (CPD) does not have that option, as far as I know.

Step 2: check your author rights for green Open Access
The next step is to check SHERPA/RoMEO for your self-archiving rights. This is essential, as this is different for every journal; this is basically business model by obscurity, and without any standardization this is not FAIR in any way. For CDP it reports that I have quite a few rights (more than some bigger journals that still rely on Green to call themselves an "leading open access publisher", but also less than some others):

SHERPA/RoMEO report for CPD.
Many journals do not allow you to self-archive the post-print version. And that sucks, because a preprint is often quite similar, but just not the same deal (which is exactly what closed access publishers want). But being able to post the post-print version is brilliant, because few people actually even kept the last submitted version (again, exactly what closed access publishers want). This report also tells you where you can archive it, and that is not always the same either: it's not uncommon that self-archiving on something like Mendeley or Zotero is not allowed.

Step 3: a post-print version that is not the publisher PDF??
Ah, so you know what version of the article you can archive, and where. But we cannot archive the publisher PDF. So, no downloading of the PDF from the publisher website and putting that online.

Step 4: a custom PDF
Because in this case we are allowed to archive the post-print version, I am allowed to copy/paste the content from the publisher PDF. I can just create a new Word/LibreOffice document with that content, removing the publisher layout and publisher content, and make a new PDF of that. A decent PDF reader allows you to copy/paste large amounts of content in one go, and Linux/Win10 users can use pdfimages to extract the images from the PDF for reuse.

Step 5: why stick with the PDF?
But why would we stick with a PDF? Why not use something more machine readable? Something where that support syntax highlighting, downloading of table content as CSV, etc? And that made me think of my recent experiments with Markdown.

So, I started of with making a Markdown version of the second CDK paper.

In this process, I:

  1. removed hyphenation used to fit words/sentences nicely in PDF columns;
  2. wrapped the code sections for syntax highlighting
  3. recovered the images with pdfimages;
  4. converted the table content to CSV (and used Markdown Tables Generator to create Markdown content) and added "Download as CSV" links to the table captions;
  5. made the URLs clickable; and,
  6. added ORCID icons for the authors (where known).
Preview of the self-archived post-print of the second CDK article.
Step 6: tweet the free Green Open Access link
Of course, if no one knows about your effort, they cannot find your self-archived version. In due time, Google Scholar may pick it up, but I am not sure yet. Maybe (Bio) will help, but that is something I have yet to explore.

It's important to include the DOI URL in that link, so that the self-archived version will be linked to from services like

Next steps: get Unpaywall to know about your self-archived version
This is something I am actively exploring. When I know the steps to achieve this, I will report on that in this blog.