Pages

Saturday, April 20, 2013

#ACSNola talk: "An architecture for an Open Science molecular compound database"

About half a year ago I was fed up with the slow progress in Open Data in chemistry. Some initiatives exist and some projects, but there is no clear central point of access, which particularly is a problem for smaller providers. Thus, I set out a project plan to make this change. There were two other aspects that I wanted to include here:
  1. licensing must be explicit, to allow aggregators to know under what conditions they can redistribute that data (or not)
  2. compound databases must start being clear on whether entries are specific compounds and if listed properties are for a specific tautomer (or not)
This is of critical importance to do reasoning over data in multiple data sets, as recently outlined in our Applications of the InChI paper, or for large data integration projects like Open PHACTS.

This presentation captures all the usual suspects, like the Panton Principles, lists some truly Open Data in chemistry (e.g. CrystalEye), and outlines the architecture I am working on. The primary purpose of this project is Linked Open Data for chemistry and to boost this field. Sadly, grant writing interfered with my agenda, and I did not manage to complete the full demo, but the slides contain this real-world screenshot that shows what it looks like (and I expect this put this publicly online in 1-2 months):


By no means this architecture expected to be as functional as Open PHACTS or to replace large compound databases like ChemSpider or PubChem. Instead, it is meant as a simple architecture that does two things right and is simple enough to set up, that any chemistry lab can do it. Goal: to increase the size of the chemical Linked Open Data network, which is way too small at this moment. I will list LinkedChemistry.info data sets with DataHub.io.

Basically, you set up a SPARQL endpoint with the data you want to share and the Chemical Compound Box as PHP front end using ARC2. That's it.

And the slides of the #ACSNola presentation: