Monday, October 10, 2011

Call for Help: categorizing Open Data repositories for chemistry

Where to host chemistry data? This was the question two people asked a few weeks ago:
I had these two blog posts open in my browser since about the time they were blogged, intending to reply. But I could not come up with a good answer, despite I was hoping to do so. For RDF-based data there are a few options now, such as Kasabi and Science 3.0. Also, for crystallography data there is the Crystallography Open Database, and for quantum chemical calculations there is Quixote. And, of course, annotated NMR spectra can go into the NMRShiftDB.

But for chemistry data in general I do not know a solution. What to do with JDX files, with images of chromatograms? BioTorrents perhaps? But that is mostly for large data sets, and does not have a clear indexing approach. ChemSpider, as Jean-Claude has been doing for spectra (see this YouTube video)? ChemSpider does not have a solution to extracting the Open Data.

These are features such a repository must have:
  1. allows you to specify who is the owner, creator, or similar
  2. allows you to license the data, or, to waive your rights, per Panton Principles
  3. allows users to bulk download Open Data
  4. allows users to automate data extraction
  5. data should be indexed, at least by InChI (which just got a 1.04 release)
  6. support any format
Optionally, these extras are welcome:
  1. semantic annotation of repository content
  2. provide CMLRSS feeds of new content
So, hereby this call for help: let's categorize what repositories are around that fulfill the 6 required features (or come very close). We can start of using regular blogging practices, by blogging solutions, ideas, comments, etc in reply, or use the commenting facilities here. Any activity in this area is appreciated and most welcomed by the community.


  1. Hello Egon,

    A few months ago I have my own chemo and bioinformatics company (Mind the Byte) and actually we are working in a project called iMols which will be online in some weeks (I hope!).

    iMols is a web platform that includes and integrate different chemo and bioinformatic databases and tools. But, at the beginning, the service will be more focused on chemoinformatics (because is the research field where I have more experience).

    As I saig, iMols will integrate different databases and will allow users to work with them doing things such as, search for similar compounds, group on sets, download, searches by descriptors or by activity against a given protein.

    All data will be stored using standard keyc (inchi key for chemicals and SwissProt id for proteins) and the service will include a social network shell allowing users to share information and knowledge.

    Finally, different level of users (free and not free) will be available depending of the services you want to use.

  2. Alfons, will it support any arbitrary binary data file?