Saturday, May 08, 2010

CIP rules #2: parsing @ and @@ from SMILES

I recently wrote about a project for a (partial) CIP implementation. This implementation is in place, and we are working towards setting up an extensive test suite. The data set we had in mind was available as SMILES and as MDL molfile. Now, the latter does not really specify the stereochemistry of the tetrahedral centers, and relies on wedge bonding. Actually, a few years ago Jonathan Brecher wrote up the IUPAC recommendation for the use of the wedge bond for chirality specification (doi:10.1351/pac200678101897), with 74 pages of rules and examples, like the following (copyright by authors or journal; I'm claiming fair use):

So, using wedges leaves plenty of room for incorrectly specifying the stereochemistry. Therefore, we decided to go for SMILES, even though Noel recently showed that processing stereochemistry in SMILES is not trivial either. The SMILES I am currently using:
  • Br[C@@H](Cl)I
  • Br[C@H](Cl)I
  • Br[C@@]([H])(Cl)I
  • Br[C@]([H])(Cl)I
  • [C@]12(OC1)NCN2
  • C[C@H](O)[C@H](O)C
  • NC([C@H](O)C)Cl
  • I1.Cl2.Br3.[C@]123CCC
I'm looking for more corner cases... please leave them as comments.


  1. How about:
    as a counterpart to your last SMILES

    is quite a nasty example of two tetrahedral steroecentres.

    While these kind of example might not be immediately clear by eye, they aren't that bad at all from an algorithmic perspective as in all cases you can find the atomRefs4 by just reading the SMILES in the usual way with resolution of the ring openings to actual atoms when you encounter the appropriate ring closure.

  2. Daniel, particularly the second is interesting... but I'm happy to report that both work out-of-the-box :)