Tuesday, October 18, 2011

The Blue Obelisk Shoulders for Translational Cheminformatics

I guess reader of my blog already heard about it via other channels (e.g. via Noel's blog post), but our second Blue Obelisk paper is out. In the past five-ish years since Peter instantiated this initiative, it has created a solid set of shoulder on which to developed Open Source-based cheminformatics solutions. I created the following diagram for the paper, showing how various Blue Obelisk projects interoperate (image is CC-BY, from the paper):

It shows a number of Open Standards (diamonds), one Open Data set (rectangles), and Open Source projects (ovals). What does diagram is not showing, is the huge amount of further Open Source cheminformatics projects around, that use one or more of the components listed here, but which do not link themselves to the Blue Obelisk directly. And there are many indeed, both proprietary and Open.

I am proud of this diagram: it really shows that the interoperability we set out in the first paper worked out very well! This makes the Blue Obelisk an excellent set of shoulders to do translational cheminformatics.

Translational cheminformatics?? Well, I have been looking for a while for a good term for my research regarding all that hacking on the CDK, Bioclipse, etc. Now, that's the translation of my core molecular chemometrics research to other scientific fields, like metabolomics, toxicology, etc.

ResearchBlogging.orgGuha, R., Howard, M., Hutchison, G., Murray-Rust, P., Rzepa, H., Steinbeck, C., Wegner, J., & Willighagen, E. (2006). The Blue Obelisk - Interoperability in Chemical Informatics Journal of Chemical Information and Modeling, 46 (3), 991-998 DOI: 10.1021/ci050400b

ResearchBlogging.orgO'Boyle NM, Guha R, Willighagen EL, Adams SE, Alvarsson J, Bradley JC, Filippov IV, Hanson RM, Hanwell MD, Hutchison GR, James CA, Jeliazkova N, Lang AS, Langner KM, Lonie DC, Lowe DM, Pansanel J, Pavlov D, Spjuth O, Steinbeck C, Tenderholt AL, Theisen KJ, & Murray-Rust P (2011). Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. Journal of cheminformatics, 3 (1), 37 PMID: 21999342 DOI: 10.1186/1758-2946-3-37


  1. That picture of yours is really great. It clearly illustrates the way in which open source software is developed within an ecosystem of other software, and how certain components (e.g. a cheminformatics toolkit) really enable further developments in the field.

    (BTW, don't forget the article number of the recent paper, 37!)

  2. Yeah, lack of metadata in CrossRef and PubChem... I'll add it manually...

  3. Greg Landrum participates in OpenSMILES, so there should also be an RDKit->OpenSMILES link. You're on there too, so why not a CDK->OpenSMILES one?

  4. Andrew,

    1. the links show how the tools interoperate, not the people. So, me having looked at (asked questions / commented on) OpenSMILES is irrelevant; CDK does not support SMILES formally.

    2. I don't know if RDKit supports OpenSMILES. They authors have been asked to inform me about missing links, so I do not know if RDKit specifically supports OpenSMILES or just SMILES.

    So, that explains why those links are not there.

    It should be said that since OpenSMILES is so close to the original SMILES, you can in fact read OpenSMILES often with the CDK SMILES parser too, and likely with the RDKit one too. I don't know about the details, and how much noise that causes...

    Students are welcome to apply with me for a project to use the OpenSMILES grammar to create a (new) OpenSMILES parser from scratch :)

  5. In that case, Open Babel reads and writes SMILES which aren't allowed by OpenSMILES; a notable example being its radical notation. OpenSMILES is a restrictive subset of SMILES, and I don't think any OB code changed in order to support it.

  6. They report they do:

    So, this would be a bug, then.

    How can people reproduce it?

  7. I forgot about that. So we "implement the OpenSMILES specification, along with an extension for radicals". How does that sound?

  8. Noel, is there also an option to output strict OpenSMILES, with, say, -oosmi ?

  9. There's no such option. But the OpenSMILES spec does not cover radicals.

  10. Open Babel's SMILES reader supports "[C--]", which is valid Daylight SMILES syntax but invalid OpenSMILES syntax.

    Open Babel's SMILES reader supports "[C-H]" which a valid Daylight SMILES but again an invalid OpenSMILES.

    OpenSMILES does not support these forms because 1) no one uses it, and 2) it makes parsers more complicated.

    Well, almost no one uses it. RDKit had a data set which used "--". Greg has since updated all of the RDKit SMILES data so they are OpenSMILES compatible. (Hence an arrow from RDKit to OpenSMILES is appropriate.)

    In practice, OpenSMILES codifies the behavior that every tool already does, so it's hard to tell if a program is specifically influenced by OpenSMILES based on its output.

  11. I consider non-implementation of the spec any examples where OB does not correctly read OpenSMILES, or where OB writes a SMILES string which is not valid OpenSMILES.

    (I think this discussion should either move to the opensmiles or openbabel mailing lists...)

  12. Hi Noel - I got off topic in showing that Open Babel (correctly, IMO) handles a superset of OpenSMILES. My point is that there should be an arrow from RDKit to OpenSMILES.

    Greg Landrum has been part of OpenSMILES discussion for years, and changed part of RDKit to reflect that consensus. Not that he had to change much - OpenSMILES codified the best practices of what toolkits like Open Babel and RDKit already did.

    Every SMILES parser on the planet should handle OpenSMILES as input and generate OpenSMILES as output.

    Which makes me wonder about Egon's comment "you can in fact read OpenSMILES often with the CDK SMILES parser too" -- what part of OpenSMILES is not supported by CDK's SMILES parser?