Saturday, September 05, 2009

NMRShiftDB RDF #1: Spectra by InChI

Originally, I wanted to include a SPARQL query in my yesterdays blog showing how to retrieve NMRShiftDB spectra based on an InChIKey, but it horribly failed. I have yet to discover why. This morning I discovered that it is specific for that field, and that using the same thing with InChI is no problem:


  1. I included your new resource into Bio2RDF today... I figured out a workaround for the inchikey issue using FILTER(REGEX(?inchikey, "inchikey")). Turns out that the workaround doesn't work for InChI's though, but they work with the simple way. I am not sure where the bug is, but you could try getting it sorted out on the virtuoso mailing list.,18%2910-17/h5,9-10,12,18H,4,6-8H2,1-3H3/t12-,14-,15+/m0/s1

  2. Ah, that was quick! How did you copy the triples so quickly from my Virtuoso to yours?

    Something else... Are you using rewrites to map the REST-like URLs to queries with full SPARQL queries?

    BTW, I also note that I need to update the foaf:homepage for the molecules and spectra, which are still pointing to a server which was recently removed from the NMRShiftDB network...

    OK, what would be my next step, to improve integration into Bio2RDF? I also like to know how we will keep the data synchronised. There is a lot of information in the original SQL database which I have not yet mapped to RDF. And I want to align with Nico Adams' ChemAxiom, which may require some changes in the graphs...

  3. I didn't copy the triples across. I am using your SPARQL endpoint live (although the triples look different).

    I am using rewrites using a java package called urlrewrite, but they are not one REST URL to one SPARQL query, so the rewrite just pushes the REST information into my JSP file and it is processed there.

    Keeping the data synchronised is an issue that we haven't completely figured out. One of the easiest ways would be not to export the RDF to a separate store, but to map your SQL database into RDF live.

    I haven't actually experimented with these ways before but there are a few of them available. One that a few of the LODD sources have used is D2R Server, but there is also the Virtuoso RDF Views but I am not sure if it is required to have the data in the Virtuoso server for that one.

    If you change use D2R then you can do all of the basic SPARQL queries, but aggregation (such as count(?spectrum) etc.) won't work.

    One alternative is to use D2R to regularly dump the information and load the RDF dump into a new Virtuoso and then point the visible SPARQL endpoint to that instance and delete the old one so you would keep the downtime on the endpoint to a minimum. If you do this then the aggregation will still work.

    As long as you have a public SPARQL endpoint the Bio2RDF REST URL resolvers can use that endpoint as their source of information for these URI's but if you have dumps available in the future it would reduce the number of queries your server would have to handle based on the Bio2RDF URI's.

  4. It would also be useful for the Virtuoso database to enable full-text indexing.