Thursday, April 22, 2010

CIP rules for stereochemistry

Uniquely identifying stereochemical enantiomers is an important aspect of data exchange of chemical structures. The simplest, most neglected solution is to pass around 3D models, but a lot of people like to stick to things like SMILES, or IUPAC names. Now, given that we want to uniquely represent the stereochemistry, we can use special rules. One example for enantiomers are the Cahn-Ingold-Prelog (CIP) rules.

The CDK does not have an implementation of (part of) the CIP rules. However, we recently started a collaboration with Dr Lars Carlsson in the Computational Toxicology, Global Safety Assessment group at AstraZeneca R&D Mölndal, headed by Dr Scott Boyer. Within this collaboration I have started an partial implementation of the CIP rules. The full set of rules is quite extensive, and some subrules are outside the scope of the collaboration. For example, we will likely not look at axial or helical stereochemistry within this collaboration. The kind of things it is able to do is distinguish between these mirror images (yeah, I should use Jmol, but ChemPedia needs more plugging right now: click the images):

The current patch is not looking into the problem of which atom is chiral; that problem is quite complex in itself, and Tim is writing up a nice set of blogs about that. Further, the current aims focuses only at application to atoms of ligancy four; that is, carbons.

The CIP rules uniquely define the stereochemistry of such a carbon, by uniquely ordering the ligands around the atom. Using rules the ligands are ordered, and they include rules defining priority based on atomic number, mass number, etc. It is the recursion that makes things more interesting, but I will not delve into the details of the algorithm here (see the aforelinked Wikipedia page instead, or a cheminformatics book like the one shown on the right). Here, I want to introduce some of the API of the current patch for the CDK.

Ligands and their Priorities
Core to the implementation are the CIP priority rules, that allow ordering of the ligand. So, we define a molecule, and ligands:
IMolecule molecule = parser.parseSmiles("IC(Br)(Cl)[H]");
ILigand ligand1 = new Ligand(
  molecule.getAtom(1), molecule.getAtom(2)
ILigand ligand2 = new Ligand(
  molecule, molecule.getAtom(1), molecule.getAtom(0)
ISequenceSubRule rule = new CIPLigandRule();
Assert.assertEquals(-1,, ligand2));
Assert.assertEquals(1,, ligand1));
This JUnit test looks at the chiral compound given earlier, but without specifying the stereochemistry using the @@/@ SMILES syntax; we get to that later. Here, the example defines two ligands around atom 1 (which is the carbon; the index starts at 0). The first ligand is the bromine, the second ligand is the iodine. Because the latter takes priority according to the CIP rules, the compare(ligand1, ligand2) returns -1.

The CIPTool
This CIPLigandRule is used in the CIPTool to provide more user-oriented methods. The goal, obviously, is this bit of code:
IMolecule molecule = parser.parseSmiles("ClC(Br)(I)[H]");
LigancyFourChirality chirality =
    molecule, 1, 4, 0, 2, 3, STEREO.CLOCK_WISE
Because we do not have 3D coordinates in our SMILES, we define the stereochemistry as CLOCK_WISE and ANTI_CLOCK_WISE. The former here means that, looking from the first ligand, following atoms 2, 3, and 4 are oriented in a circle in a clock-wise turn. This defines uniquely the geometrical orientation, but which changes between CLOCK_WISE and ANTI_CLOCK_WISE upon every atom-atom exchange. Therefore, we uniquely prioritize the ligands, project, and translate the resulting CLOCK_WISE or ANTI_CLOCK_WISE in the appropriate R and S stereochemistry.

That's all for now. Questions, ideas and others most welcome in the comment!


  1. I saw your comment on Twitter about these structures not being on ChemSpider. Search based on the first part of the InChI and you will see three structures. Search on WUHPSARYLVYQOT

  2. Hej ChemSpiderman,

    yes, I see them now. Why do they not show up when you click in 'Isomers' which links to this search page:

    Is that a known bug?

  3. For searching for compounds with the same skeleton...hover over the term "Similar" and you will see a callout box explaining that clicking on Similar does this. However, it does appear that the search on "Isomers" is not working right now. Thanks for highlighting it.

  4. Great stuff ! Hopefully, it will be tested very well. A bad example for missing tests can be found here

    On the other had testing CIP-rules is much easier ! When you always claim it's "R" - you are right at 50% anyway - statistically spoken ! ;-)

  5. Dear Wolfgang,

    the CDK actually has one of the most extensive test suites around, with more than 15k unit tests. We have not reached 100% coverage, but working our way there. All new functionality is peer-reviewed and must include unit tests. This works very well.

    Oh, and I do not believe settling for anything but 100% is the goal; I surely hope C-SEARCH is not doing that with its test suite.

    Importantly, just like the CDK itself, the unit tests are freely available, and the unit tests are peer-reviewed too. This way, anyway is able to know what he is using, removing the problem of 'trust'.

    Regarding the NMRShiftDB/ChemSpider issue, you might find my comment I left today in the ChemSpider blog interesting. I actually the problem is cause by neither a CDK, nor a ChemSpider or NMRShiftDB bug, Instead, from the information I have right now, I believe the problem is caused by 'features' in the SMILES format, and how these are practically used. Loosing information in data exchange is a very common but unnecessary event. As you might know, I have been promoting the use of semantic data exchange languages. The lack of explicit bond orders and assuming the other party will be able to guess what you meant, seems to be the underlying issue.

  6. wolfgang Robien12:43 PM, May 03, 2010

    This comment has been removed by a blog administrator.

  7. wolfgang Robien1:30 PM, May 03, 2010

    Dear Egon;

    as a user of a system I am definitely NOT interested WHERE the problem is. I simply see, that there is a problem - benzene at 27.1ppm in CNMR is wrong - hopefully we can agree on that (assuming TMS as standard compound) !
    Is it really an obscene desire to expect a piece of software to be tested with the most common and simple aromatic compound to figure out that error BEFORE the community is bothered ? A software free of errors is a dream, but the most trivial errors should be eliminated BEFORE you put something into the public domain ! As a user I would expect that such an interface is tested AT LEAST with lets say 10 different compounds - linear, branched, cyclic, non-aromatic, aromatic, heteroaromatic and some combinations thereof. I have understanding that some 'esoteric' compounds might cause problems, but benzene and indole are quite common. If your finding is, that the problem is within SMILES, than use an alternative ! For me as an user it's completely irrelevant what you use internally for data transfer between Chemspider and NMRSHIFTDB- I simply want to see a correct prediction !
    A CNMR-signal for benzene at 27.1ppm is not correct - thats exactly what I got today and what everybody can reproduce.
    I hope you got the point !

    Kind regards, Wolfgang

  8. Egon....on NMRSHiftDB it might be easier/better to pass molfiles or InChIStrings instead of SMILES. At least the InChI is a standard...and while not perfect at least it uses standard libraries and would be easy to implement. SMILES is a challenge because there are so many parsers...

  9. Antony, MDL molfiles should work, though I am not sure there is a webservice in place for prediction using such.

    Using InChI has it's own problems. It should work fine for spectrum lookup, but... it normalizes for tautomerism, and has the mobile hydrogen concept. How would expect prediction to work in those situations?