Monday, October 10, 2011

Call for Help: categorizing Open Data repositories for chemistry

Where to host chemistry data? This was the question two people asked a few weeks ago:
I had these two blog posts open in my browser since about the time they were blogged, intending to reply. But I could not come up with a good answer, despite I was hoping to do so. For RDF-based data there are a few options now, such as Kasabi and Science 3.0. Also, for crystallography data there is the Crystallography Open Database, and for quantum chemical calculations there is Quixote. And, of course, annotated NMR spectra can go into the NMRShiftDB.

But for chemistry data in general I do not know a solution. What to do with JDX files, with images of chromatograms? BioTorrents perhaps? But that is mostly for large data sets, and does not have a clear indexing approach. ChemSpider, as Jean-Claude has been doing for spectra (see this YouTube video)? ChemSpider does not have a solution to extracting the Open Data.

These are features such a repository must have:
  1. allows you to specify who is the owner, creator, or similar
  2. allows you to license the data, or, to waive your rights, per Panton Principles
  3. allows users to bulk download Open Data
  4. allows users to automate data extraction
  5. data should be indexed, at least by InChI (which just got a 1.04 release)
  6. support any format
Optionally, these extras are welcome:
  1. semantic annotation of repository content
  2. provide CMLRSS feeds of new content
So, hereby this call for help: let's categorize what repositories are around that fulfill the 6 required features (or come very close). We can start of using regular blogging practices, by blogging solutions, ideas, comments, etc in reply, or use the commenting facilities here. Any activity in this area is appreciated and most welcomed by the community.