Pages

Saturday, August 31, 2013

The Dutch Dataverse Network: a host for the ChEMBL-RDF v13.5 data, and some thoughts in workflow integration

Last Thursday, there was a UM library network drink. And as I see a library where knowledge is found, and libraries still rarely think of knowledge as ever being able to be stored outside books and papers, I was happy to see the library promoting the Dutch Dataverse Network. So, I had to try it, and see if it fulfilled my basic needs:

  1. shows under what conditions people can download, modify, and redistribute data;
  2. has a high visibility on the web; and
  3. it is open source and developed by @thedataorg.
And it does: and here is the v13.5 data behind the ChEMBL-RDF paper (which you can also query via this SPARQL end point at Uppsala University):


Now, Chris Evelo wondered about the purpose of this. Can we import the data in analysis platforms? How does it related to efforts our group is involved in, like dbNP etc?

Well, my primary reasons for doing the above was to test the system and to serve more than 700MB of RDF triples. Thus, with the testing of the system done, the question now is, can we use it in data analysis platforms. I found this R package by Thomas J. Leeper. It has several methods, some of which are given in the below example code:

search = dvSearch(
  dv="https://www.dataverse.nl/dvn/",
  list(authorName="willighagen")
)

This shows us the handle of my one and currently only data set, hdl:10411/10279. We need some further information, such as the formats of the records metadata:

formats = dvMetadataFormats(
  dv="https://www.dataverse.nl/dvn/",
  search$objectid[1]
)

When we combine this, we can retrieve the metadata:

metadata = dvMetadata(
  dv="https://www.dataverse.nl/dvn/",
  search$objectid[1],
  format.type=formats$formatName[1]
)

Now, the package then has a dvExtractFileIds() method to extract the file names. But the metadata for my record is not compatible, and you need this code instead (and this has likely to do with me not knowing the Dataverse system in enough detail to use it properly):


extractOtherMatFileIds = function (xml) 
{
    nodes <- xmlChildren(xmlChildren(xmlParse(xml))$codeBook)
    dscrs <- nodes[names(nodes) == "otherMat"]
    d <- data.frame(matrix(nrow = length(dscrs), ncol = 4)) 
    names(d) <- c("fileName", "fileId", "level", "URI")
    for (i in 1:length(dscrs)) {
        attrs <- xmlAttrs(dscrs[[i]])
        d$fileName[i] <- xmlValue(xmlChildren(dscrs[[i]])$labl)
        d$level[i] <- attrs[names(attrs) == "level"]
        d$URI[i] <- attrs[names(attrs) == "URI"]
        d$fileId[i] <- strsplit(d$URI[i], "fileId=")[[1]][2]
    }
    return(d)
}

So that, similar to the dvn package help PDF, we can continue with:

files = extractOtherMatFileIds(metadata)
info <- dvDownloadInfo(
  dv="https://www.dataverse.nl/dvn/",
  files$fileId[1]
)


We are now ready to download the data, but, except for a small bug in the package, we run into a wall:

data <- dvDownload(
  dv="https://www.dataverse.nl/dvn/",
  files$fileId[1]
)

We we do not get access, and get this error instead:


Error in dvDownload(dv = "https://www.dataverse.nl/dvn/", files$fileId[1]) : 
  
    Terms of Use apply.
  
Data cannot be accessed directly...try using URI from dvExtractFileIds(dvMetadata())


This is despite me marking the data as Public. I do not know the reason yet, but it could have to do with setting the CC-BY-SA license? They indeed do apply, but that doesn't mean an anonymous user cannot download it. The error seems to originate from info in here:

dvQuery(
  dv="https://www.dataverse.nl/dvn/",
  verb = "downloadInfo",
  query = files$fileId[1]
)

This code is part of the download function, which returns a XML snippet with this part:

<accessRestrictions granted="false">

The workaround suggested in the error message is to just use the URI, using default R functionality:

download.file(
  files$URI[1],
  files$fileName[1]
)

However, this does not return the data file, but a HTML page that allows one to accept the terms of use. Of course, we can use the browser option in several of the methods, but any user interaction makes downloading data in a workflow setting unrealistic.

Now, it should be able to automate that, not? We should be able to instruct the dvn package and the Dataverse network to always accept Creative Commons licenses, right?