Saturday, April 05, 2014

Every PhD student must use Git (aka research data management)

Last Thursday and Friday the SURFAcademy Masterclass Research Data Management in Nederland took place, and Chris Evelo and I presented some biology-world use cases. He focused more on the larger projects (e.g. ISA-TAB, GSCF, and FAIRPort) while I exposed my day to day data management. My day to day work habit looks more or less like this.

Day 0 is to think about how to do it, but the answer is pretty simple: use a version control system, like Git. Because it tracks every bit of what you do, allows for easy back ups, and makes it easy to continue working on a different machine in case you forget to take your laptop adapter home :)

  • Day 1: keep an electronic lab notebook (e.g. a version control system; read Git from the Bottom Up)
  • Day 2: carefully select data you build on (can you indeed share it with the rest of your arguments in your next paper?)
  • Day 3: do you research and store everything
  • Day 4: integrate data repositories in your data analyses, e.g. rrdf and knitr
  • Day 5: if you like scientific dissemination, collaboration, and progressing science, share your data in public repository, like FigShare, Data Dryad, Dutch Dataverse, 3TU.Datacentrum, DANS, etc. (that's a lot of D-D-D-Data...) or in a domain specific database, like WikiPathways, XMetDb, or DrugMet. And data copyright and licenses and particularly, whatever you chose, be explicit about it and don't let others guess (wrong).
  • Day 6: think ahead of reuse, and suitable formats. Consider semantic web and linked data.
  • Day 7: did you get impact? Think DataCite, ImpactStory, and Altmetric (and ORCID and DOI along the way).
And here are the slides: