Friday, September 28, 2007

SMILES to become an Open Standard

Craig James wants to make SMILES an open standard, and this has been received with much enthusiasm. SMILES (Simplified molecular input line entry specification) is a de facto standard in chemoinformatics, but the specification is not overly clear, which Craig wants to address. The draft is CC-licensed and will be discussed on the new Blue Obelisk blueobelisk-smiles mailing list.

Illustrative is my confusion about the sp2 hybridized atoms, which use lower case element symbols in SMILES. Very often this is seen as indicating aromaticity. I have written up the arguments supporting both views in the CDK wiki. I held the position that lower case elements indicated sp2 hybridization, and the CDK SMILES parser was converted accordingly some years ago. A recent discussion, however, stirred up the discussion once more (which led to the aforementioned wiki page).

You can imagine my excitement when I looked up the meaning in the new draft. It states: The formal meaning of a lowercase "aromatic" element in a SMILES string is that the atom is in the sp2 electronic state. When generating a normalized SMILES, all sp2 atoms are written using a lowercase first character of the atomic symbol. When parsing a SMILES, a parser must note the sp2 designation of each atom on input, then when the parsing is complete, the SMILES software must verify that electrons can be assigned without violating the valence rules, consistent with the sp2 markings, the specified or implied hydrogens, external bonds, and charges on the atoms..


  1. Nice overview, Egon. It'll be interesting to see how this develops.

    Part of the problem will be getting so many different people, each with their own ideas about what SMILES is about and each with their own understanding of poorly-documented SMILES features, to work together.

    InChI was truly an amazing feat when you look at it this way.

    Then again, they started from scratch...

  2. I sat in on a number of the early InChI discussions and asked the question a number of times - why not just make SMILES Open Source. The answer I believe is still the fact that Daylight SMILES has capabilities ahead of others. This is only my perception..I don't know the details in reality. Daylight could be community heroes if they rolled out their SMILES code to the world. i am uncertain of how much of their IP depends on this format.

    Overall I am very excited to hear that SMILES might be Open Sourced. What will it mean for InChI. It seems there may be a collision...

  3. CSM, this is not so much about the open *source*, as it is about the open *standard*. The task at hand right now, is to make the SMILES standard consistent, so that people can actually implement it from the standard, without having to make guesses from the de facto SMILES Depict standard.

    I am not sure if Daylight is ahead of the (open source) competition; that would require access to the source code :) Note, that I do not want access to their source code, because that would taint the CDK source code with proprietary ideas.

    That's why the Open Standard is important.

  4. Are you aware of the Daylight licensing of SMILES?

    Please note that there were some issues with MQL, which has already a CDK binding!

    You should solve that question first, and to be honest MQL is more advanced then SMILES and we have already provided an EBNF notation!

    Are you aware of the discussion of Andrew, me and Noel?

    Do not get me wrong, I like open standards, but why just copying old stuff?

    Cheers, Joerg

  5. Hi Joerg,

    a license does not apply here, unless they have a patent on SMILES; it's a standard not a piece of software. The possible problem of trademarks on 'SMILES' has come up, see the blueobelisk-smiles mailing list. Regarding the issues of MQL, I think those had a completely different nature.

    MQL has interesting features, and people are actually suggesting extensions on SMILES, but that is not said to be the goal of OpenSMILES version 1. Moreover, MQL is on the level of SMARTS.

    BTW, I do not think OpenSMILES is copying old stuff. MQL sounds more like that; it's just making the standard Open (and solving a few sources of ambiguity).

    Regarding EBNF; that's a step forward, but most certainly not make the difficult parts of SMILES parsing easier: resolving bond orders, and detect aromaticity.

    MQL is promosing, but still lacks an opensource implementation we can use in the CDK or Bioclipse.