Friday, July 16, 2010

A new CDK default fingerprinter?

The current default fingerprinter in the CDK depends on aromaticity, but that concept is algorithmically difficult to define, and even experimentally there are multiple dimensions to this concept. Moreover, calculating aromaticity is not cheap, as it requires detecting of ring systems. The purpose why aromaticity is actually included is this: people expect a ethenol moiety to match phenol.

Now, an alternative is to not use aromaticity, but hybridization information instead: an aromatic bond is basically just a bond between two sp2-hybridized atoms. Removes some algorithmic complexity and speeds up the calculation:

The definition of the fingerprint has changed, and a bond between two sp2-hybridized atoms may not be aromatic. We can therefore expect that the fingerprint will give more false positives with substructure search. I'm hoping that Rajarshi can find some time to compare this new fingerprint in his excellent analysis he did some time ago.

The source code can be found in my GitHub repository, with the new class HybridOnlyFingerprinter.