Sunday, March 30, 2014

Linked Open Drug Data: three years on

Almost three years ago I collaborated with others in the W3C Health Care and Life Sciences interest group. One of the results of that was a paper in the special issue around the semantic web conference at one of the bianual, national ACS meeting (look at this nice RDFa-rich meeting page!). My contribution was around the ChEMBL-RDF, which I recently finally published, though it was already described earlier in an HCLS note.

Anyway, when this paper reached the most viewed paper position in the JChemInf journal, and I tweeted that event, I was asked for an update of the linked data graph (the darker nodes are the twelve the LODD task force worked on). A good questions indeed, particularly if you consider the name, and that not all of the data sets were really Open (see some of the things on Is It Open Data?). UMLS is not open; parts of SIDER and STICH are, but not all; CAS is not at all, and KEGG Cpd has since been locked down. Etc. A further issue is that the Berlin node in the LODD network is down, which hosted many data sets (Open or not). Chem2Bio2RDF seems down too.

Bio2RDF is still around, however (doi:10.1007/978-3-642-38288-8_14). At this moment, it is a considerable part of the current Linked Drug Data network. It provides 28 data sets. It even provides data from KEGG, but I still have to ask them what they had to do to be allowed to redistribute the data, and whether that applies to others too. Open PHACTS is new and integrated a number of data sets, like ChEMBL, WikiPathways, ChEBI, a subset of ChemSpider, and DrugBank. However, it does not expose that data as Linked Data. There is also the new (well, compared to three years ago :) Linked Life Data which exposes quite a few data sets, some originating from the Berlin node.

Of course, DBPedia is still around too. Also important that more and more data bases themselves provide RDF, like Uniprot which has a SPARQL end point in beta, WikiPathways, PubChem, and ChEMBL at the EBI. And more will come, /me thinks.

I am aggregating data in a Google Spreadsheet, but obviously this needs to go onto the DataHub. And a new diagram needs to be generated. And I need to figure out how things are linked. But the biggest question is: where are all the triples with the chemistry behind the drugs? Like organic syntheses, experimental physical and chemical data (spectra, pKa, logP/logD, etc), crystal structures (I think COD is working on a RDF version), etc, etc. And, what data sets am I missing in the spreadsheet (for example, data sets exposed via OpenTox)?

No comments:

Post a Comment