Saturday, April 05, 2014

Every PhD student must use Git (aka research data management)

Last Thursday and Friday the SURFAcademy Masterclass Research Data Management in Nederland took place, and Chris Evelo and I presented some biology-world use cases. He focused more on the larger projects (e.g. ISA-TAB, GSCF, and FAIRPort) while I exposed my day to day data management. My day to day work habit looks more or less like this.

Day 0 is to think about how to do it, but the answer is pretty simple: use a version control system, like Git. Because it tracks every bit of what you do, allows for easy back ups, and makes it easy to continue working on a different machine in case you forget to take your laptop adapter home :)

  • Day 1: keep an electronic lab notebook (e.g. a version control system; read Git from the Bottom Up)
  • Day 2: carefully select data you build on (can you indeed share it with the rest of your arguments in your next paper?)
  • Day 3: do you research and store everything
  • Day 4: integrate data repositories in your data analyses, e.g. rrdf and knitr
  • Day 5: if you like scientific dissemination, collaboration, and progressing science, share your data in public repository, like FigShare, Data Dryad, Dutch Dataverse, 3TU.Datacentrum, DANS, etc. (that's a lot of D-D-D-Data...) or in a domain specific database, like WikiPathways, XMetDb, or DrugMet. And data copyright and licenses and particularly, whatever you chose, be explicit about it and don't let others guess (wrong).
  • Day 6: think ahead of reuse, and suitable formats. Consider semantic web and linked data.
  • Day 7: did you get impact? Think DataCite, ImpactStory, and Altmetric (and ORCID and DOI along the way).
And here are the slides:

Tuesday, April 01, 2014

Permission to put Jmol in a paper's Supporting information

A random email correspondence (thanks to the author for asking, giving me the chance to blog the answer!):
    I want readers of a paper I am writing to see several molecules in Jmol. I could instruct them to download Jmol but this cumbersome as Jmol is 54 Mb. Apparently all that is needed is Jmol.jar, which is only 4.5 Mb. Can I get permission to add a zip file to my paper that contains jmol.jar plus a number of .pdb files?
My answer:
    Dear Jmol user,

    the Open Source license of Jmol defines the permission you ask for.

    In an ordinary world, you would have to ask *all* authors, but one of the virtues of an Open Source license is that you do not need to ask such permission, because the license explicitly provides you with the permission to redistribute the software.

    I cannot make a claim about the PDB files, of which I do not know the source.

    Hoping to have informed you sufficiently,

    with kind regards,

Network of BioThings: the EU hackathon

Very soon an international hackathon will take place: the Network of BioThings, with events in the USA and in Maastricht, The Netherlands. As I am traveling back from the NanoTox 2014 meeting, I will not be able to join in person, sadly, but will try to join online from Eindhoven.

The hackathon includes lunch, pizza, is synchronized between continents, and is aimed at:
  • Hackers and Mentors
  • Biologists, Text Miners and Data Wranglers
  • Ontologists, Terminologists, and Data Linkers
  • Semantic Web novices and experts
  • Systems and Network Biologists
  • Crowdsourcing experts and functional game designers
  • Skills in Large Text/Data Indexing, Facet Search and Browse, and REST APIs
  • Domain experts to advise on motivating use cases
Registration is open!

Sunday, March 30, 2014

Linked Open Drug Data: three years on

Almost three years ago I collaborated with others in the W3C Health Care and Life Sciences interest group. One of the results of that was a paper in the special issue around the semantic web conference at one of the bianual, national ACS meeting (look at this nice RDFa-rich meeting page!). My contribution was around the ChEMBL-RDF, which I recently finally published, though it was already described earlier in an HCLS note.

Anyway, when this paper reached the most viewed paper position in the JChemInf journal, and I tweeted that event, I was asked for an update of the linked data graph (the darker nodes are the twelve the LODD task force worked on). A good questions indeed, particularly if you consider the name, and that not all of the data sets were really Open (see some of the things on Is It Open Data?). UMLS is not open; parts of SIDER and STICH are, but not all; CAS is not at all, and KEGG Cpd has since been locked down. Etc. A further issue is that the Berlin node in the LODD network is down, which hosted many data sets (Open or not). Chem2Bio2RDF seems down too.

Bio2RDF is still around, however (doi:10.1007/978-3-642-38288-8_14). At this moment, it is a considerable part of the current Linked Drug Data network. It provides 28 data sets. It even provides data from KEGG, but I still have to ask them what they had to do to be allowed to redistribute the data, and whether that applies to others too. Open PHACTS is new and integrated a number of data sets, like ChEMBL, WikiPathways, ChEBI, a subset of ChemSpider, and DrugBank. However, it does not expose that data as Linked Data. There is also the new (well, compared to three years ago :) Linked Life Data which exposes quite a few data sets, some originating from the Berlin node.

Of course, DBPedia is still around too. Also important that more and more data bases themselves provide RDF, like Uniprot which has a SPARQL end point in beta, WikiPathways, PubChem, and ChEMBL at the EBI. And more will come, /me thinks.

I am aggregating data in a Google Spreadsheet, but obviously this needs to go onto the DataHub. And a new diagram needs to be generated. And I need to figure out how things are linked. But the biggest question is: where are all the triples with the chemistry behind the drugs? Like organic syntheses, experimental physical and chemical data (spectra, pKa, logP/logD, etc), crystal structures (I think COD is working on a RDF version), etc, etc. And, what data sets am I missing in the spreadsheet (for example, data sets exposed via OpenTox)?

Friday, March 28, 2014

"Bridging WikiPathways and metabolomics data using the ChEBI ontology"

This week the ChEBI 3rd User Workshop took place, and I presented how WikiPathways is using ChEBI, and how I have been using it in the BridgeDb identifier mapping database for metabolites, and in mapping metabolites to WikiPathways using the ChEBI ontology.