Thursday, February 05, 2009

Where can I host my experimental data? Open Submission Chemistry Databases #1

Rich just posted an interesting read on Web-Centric Science, after a gauntlet thrown down by The Realm of Organic Synthesis (TROS).

I agree that this still is a problem: where can (organic) chemists host their data? TROS hints as Wikipedia, but an encyclopedia is not always the most suited place for cutting edge chemistry (article can easily be biased, contain (science) political views, etc...). I would suggest a blog would be a good start, and if proper markup would be used services like Chemical blogspace would automatically aggregate it.

However, something less volatile might be interesting. So, what we need is an overview of web databases where experimental chemistry data can be hosted. I'll start one, and annotate resources with license, on, using the tags chemistry +web +database +open +submission, and regularly summarize things here.

In the below table, the last column indicated the most liberal license you can use to host your data:

databasedata typelicense
ChemSpiderStructures, links to papers, spectraopen data
SORDOrganic Reactions?

There are some obvious gaps here, if you consider a typical experimental section. What to do with an measure melting point, IR spectra, mass spectral information, and measured elemental composition.


  1. Why not just throw ALL the information you are talking about onto ChemSpider. I've done it here using Aniline as an example:

    Check out the description section..i copied your blogpost there and gave examples.

    NOW, I acknowledge that we actually need the MP in the supplementary info section and have it searchable. But look in the supplementary section and you will see we have lots of data. We have an IR spectrum already on that example. The description section IS for Open Notebooking...we just need people to start using it.

  2. I see no reason why not, other than personal preferences (need for a particular license, journal preferences on deposition locations, scope of the database, etc). Antony, maybe you can summarize some statistics on the types (and amount of, in terms of number of molecules with...) experimental data there is in ChemSpider?

    One thing that is going to be important is how data can be searched. We need to be able to ask for all chemical compounds which are within a certain melting point range. This is why I have been so interesting in RDF frontend to ChemSpider, which provides a clean API to do such things.

  3. You listed SORD as one potential database and put a question mark against the license. I worked with the group doing that while at ACD/Labs: The online version is delivered through the ACD/Web Librarian:

    Unless something changed this is a commercial database but contributing organizations will get free access ( It's definitely not Open Data

  4. You commented "need for a particular license, journal preferences on deposition locations, scope of the database, ".

    1) Licenses - we ca provide the ability to keep data private if necessary, to provide it as Open Data if users wish and to stamp with different types of CC licenses if necessary. My judgment...the majority of users don't care.

    2) "journal preferences on deposition locations" I think you mean will certain journals support the deposition in CHemSpider. For example, relative to CAS registry? I haven't seen any discussion where journals would not support deposition in ChemSpider but it would be great to start one up. All publishers who have talked with me directly support what CHemSPider is doing

    3) Scope of the database - this is easy...we're focused on small molecules, primarily organic in nature. If people want to manage their data associated with such datatypes then ChemSpider is ready to help. Again, people just have to step forward and ask what we can do to help. What J is blogging about is easy for us to support. What we are missing right now is SAMPLE centric workflows rather than structure centric.

    I agree about your thought about making MPs searchable. I commented on that in my comment "I acknowledge that we actually need the MP in the supplementary info section and have it searchable." We've discussed RDF'ing a number of times and don't have it yet as we are too busy with other things. There are lists of functionality that the users are asking for and RDF is way down the list still.

    In terms of stats regarding experimental data the spectra are all here (880 spectra in total). There are probably 10,000 structures with measured experimental data (mp, bp etc). I'll be adding the non-aqueous solubility data from the ONS Solubility project as time allows.

  5. Regarding SORD: you are correct. SORD is planned to be freely accessible for academic groups which provide content. Industry and others pay.

    BTW, the topic of this series is 'open submission', not 'open access'. I added the license column because that is my personal interest, but the question was, where can I host experimental data; SORD is such a database.

    Maybe 'proprietary' as most permissive license would be better, but would need to ask Dick more clearly what they would do if they were offered Open Data... would that never enter the database, or like with ChemSpider.

  6. Relative to SORD...I know one of the people involved with SORD personally and know that there are good stringent criteria in place for assembling and qualifying the data. Very necessary.

    The business model, while not inappropriate, will mean that the majority of academic users will NOT have access to the data unless they contribute. So those academics will ave to be one of the groups that "pay". SORD are likely to get criticized for this.

    You commented "the topic of this series is 'open submission', not 'open access'." and I'm trying to clarify what that is? What is the definition of Open Submission? Open Data and Open Access are confusing enough but I'd like to start using this term for some things I'd like to discuss so it would be good to agree on a definition. For my needs I'd like it "the ability to submit data to an online database" but this is very loose and doesn't address licenses at all and the term "Open" is generally about licenses.

    ChemSpider today can host experimental data regarding reactions and allows Free access to anyone to search and the data are Open if declared as such. If the articles are from Open Access articles then they are Open by default.

    One thing we do NOT support yet is reaction searching. We do have the ability in ChemMantis to declare every chemical as a reactant, product, catalyst, solvent etc and this will be useful for searching papers in the future but it is not reaction searching.

    "would need to ask Dick more clearly what they would do if they were offered Open Data... would that never enter the database, or like with ChemSpider." It would be good if Open Data would be freely accessible but that will be a nightmare to manage. It WOULD mean, I believe, that data extracted from Open Access papers should be freely available BUT SORD would be re-purposing and creating a derivative work so would not need to make it freely available. CAS do this already when they index Open Access papers...the structures and abstracts etc are Open Access but closed when indexed. It's a complex environment...

    GREAT discussion Egon..would love to continue it over a coffee sometime...maybe in Salt Lake City?

  7. Yes, I was rather confusing when I said 'Open Submission'... my intention was here not to make the parallel to ODOSOS... my bad :(

    What I wanted to say with 'open submission' is the the submitted is free to contribute. NMRShiftDB allows contribution after sign-up. The CAS registry database is closed for submission (or?), as it is indexed from literature.

    So, following these thoughts, SORD is 'open', because chemists can opt in to submit their thesis.

    Suggestions on a better term? 'Free submission'? 'Public submission'?

    Yes, would be great to meet up. I don't have time, nor funding, to go to Salt Lake... I have listed the conferences I planned to attend on Dopplr:

  8. I much prefer the term "Public Submission" myself. it doesn't have the confusing nature of "open" that proliferates but clearly declares that Public Submission processes are available.

  9. Picking up the SORD ....
    We were contacted approx. 2 years ago by SORD concerning a possible cooperation and then never heard again from them. I had trouble connecting to very often... Does anyone know some details of the current status of the SORD database? All info dated 2006. The business model strongly relies on the libraries collecting the data at the universities on location. I am wondering if this is any success ....

  10. Oliver, please contact Dick Wife:

    Getting access basically comes down on approving that SORD may index your thesis. Libraries are, AFAIK, not involved in data extraction:

  11. Oliver..if you have a Chemistry thesis that you would be willing to share with me for the purpose of marking up in Chemmantis I'd love to get a copy. Word format preferred.

    Egon - other Public Submission databases. I assume the Protein Databank, Cambridge Crystallographic database and PubChem would all count? What about CrystalEye - I don't know whether that accepts direct public submissions but I have seen some talk about it on PMR's blog as being enabled to do that. And Chemistry@Freebase as Joerg points out.

  12. Freebase is just one idea for structured data collaboration.

    You know that I adore semantics and collaboration, e.g. the distributed chemical blogspace structure fetching.

    Someone mentioned that we should also think about biological information and I think they have the point, small molecules are important but mainly in the context of biological processes.

    Just allow people to help you and prepare the technical ground, e.g. with allowing 'correct this' button, comments, tags, semantics, cross-links, ... it is all about social science.