Monday, June 10, 2019

Preprint servers. Why?

Recent preprints from researchers
in the BiGCaT group.
Henry Rzepa asked us the following recently: ChemRxiv. Why? I strongly recommend reading his pondering. I agree with a number of them, particularly the point about the following. To follow the narrative of the joke: "how many article versions does it take for knowledge to disseminate?", the answer sometimes seems to be: "at least three, to make enough money of the system".

Now, I tend to believe preprints are a good thing (see also my interview in Preprintservers, doen of niet?, C2W, 2016. PDF): release soon, release often has served open science well. In that sense, a preprint can be like that: an form of open notebook science.

However, just as we suffer from data dumps for open source software, we see exactly the same with (open access) publishing now. Is the paper ready to be submitted for peer review, oh, let's quickly put it on a preprint server. A very much agree with Henry that the last thing we are waiting for is a third version of a published article. This is what worries me a great deal in the "green Open Access" discussion.

But it can be different. For example, people in our BiGCaT group actually are building up a routine of posting papers just before conferences. Then the oral presentation gives a laymens outline of the work, and if people want to really understand what the work is about, they can read the full paper. Of course, with the note that a manuscript may actually not be sufficient for that, so the preprint should support open science.

But importantly, a preprint is not a replacement for an proper CC-BY-licensed version of record (VoR). If the consensus that that is what preprints are about, then I'm no longer a fan.

Tuesday, May 21, 2019

Scholia: an open source platform around open data

Some 2.5 years ago Finn Nielsen started Scholia. I have been blogging about it a few times, and thanks to Finn, Lane Rasberry, and Daniel Mietchen, we were awarded a grant by the Alfred P. Sloan Foundation to continue working on it (grant: G-2019-11458). I'll tweet more about how it fits the infrastructure to support our core research lines, but for now just want to mention that we published the full proposal in RIO Journal.

Oh, just as a teaser and clickbait, here's one of the use cases. dissemination of knowledge of metabolites and chemicals in general (full poster):

Saturday, May 18, 2019

LIPID MAPS: mass spectra and species annotation from Wikidata

Part of the LIPID MAPS classification
scheme in Wikidata (try it).
A bit over a week I attended LIPID MAPS Working Group meeting in Cambridge, as I have become member of the Working Group 2: Tools and Technical Committee in autumn. That followed a fruitful effort by Eoin Fahy to make several LIPID MAPS pathways available in WikiPathways (see this Lipids Portal), e.g. the Omega-3/Omega-6 FA synthesis pathway. It was a great pleasure to attend the meeting, meet everyone, and I learned a lot about the internals of the LIPID MAPS project.

I showed them how we contribute to WikiPathways, particularly in the area of lipids. Denise Slenter and I have been working on having more identifier mappings in Wikidata, among which the lipids. Some results of that work was part of this presentation. One of the nice things about Wikidata is that you can make live Venn diagrams, e.g. compounds in LIPID MAPS for which Wikidata also has a statement about which species it is found in (try it):

SELECT ?lipid ?lipidLabel ?lmid ?species ?speciesLabel
            ?source ?sourceLabel ?doi
  ?lipid wdt:P2063 ?lmid ;
         p:P703 ?speciesStatement .
    ?speciesStatement prov:wasDerivedFrom/pr:P248 ?source ;
                      ps:P703 ?species .
    OPTIONAL { ?source wdt:P356 ?doi }
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".

A second query searches lipids for which also mass spectra are found in MassBank (try it):

  ?lipid ?lipidLabel ?lmid
  (GROUP_CONCAT(DISTINCT ?massbanks) as ?massbank)
  ?lipid wdt:P2063 ?lmid ;
         wdt:P6689 ?massbanks .
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
} GROUP BY ?lipid ?lipidLabel ?lmid


Saturday, May 04, 2019

Wikidata, CompTox Chemistry Dashboard, and the DSSTOX substance identifier

The US EPA published a paper recently about the CompTox Chemistry Dashboard (doi:10.1186/s13321-017-0247-6). Some time ago I worked with Antony Williams and we proposed a matching Wikidata identifier. When it was accepted, I used a InChIKey-DSSTOX identifier mapping data sets by Antony (doi:10.6084/M9.FIGSHARE.3578313.V1) to populate Wikidata with links. Overtime, when more InChIKeys were found in Wikidata, I would use this script to add additional mappings. That resulted in this growth graph:

Source: Wikidata.
Now, about a week ago Antony informed me he worked with someone of Wikipedia to have the DSSTOX automatically show up in the ChemBox, which I find awesome. It's cool to see your work on about 38 thousand (!) Wikipedia pages :)
Part of the ChemBox of perfluorooctanoic acid.
(I'm making the assumption that all 38 thousand Wikidata pages for chemicals have Wikipedia equivalents, which may be a false assumption.)

Wednesday, April 24, 2019

Open Notebook Science: the version control approach

Jean-Claude Bradley pitched the idea of Open Notebook Science, or Open-notebook science as the proper spelling seems to be. I have used notebooks a lot, but ever since I went digital, the use went down. During my PhD studies I still extensively used them. But in the process, I changed my approach. Influenced by open source practices.

After all, open source has had a long history of version control, where commit messages explain the reason why some change was made. And people that ever looked at my commits, know that my commits tend to be small. And know that my messages describe the purpose of some commit.

That is my open notebook. It is essential to record why a certain change was made and what exactly that change was. Trivial with version control. Mind you, version control is not limited to source code. Using the right approaches, data and writing can easily be tracked with version control too. Just check, for example, my GitHub profile. You will find journal articles been written, data collected, just as if they were equal research outputs (they are).

Another great example of version control for writing and data is provided by Wikipedia and Wikidata. Now, some changes I found hard to track there: when I asked the SourceMD tool (great work by Magnus Manske) to create items for books, I want to see the changes made. The tool did link to the revisions made at some point, but this service integration seems to break down now and then. Then I realized that I could use the EditGroups tool directly (HT to who wrote that), and found this specific page for my edits, which includes not just those via SourceMD but also all edits I made via QuickStatements (also by Magnus):

If only I could give a "commit message" which each QuickStatements job I run. Can I?