Most of you have already developed a love/hate relation with the CDKAtomTypeMatcher. This is the class that complains about atom types not being recognized. And let me stress this one more time. When you get such a method, it means that the CDK cannot add hydrogen, it cannot calculate QSAR descriptors, etc, etc. I have had the comment that CDK 1.0 never complained about that. True. It just ignored the fact that it did not know what to do with that atom, and calculated (wrong) properties anyway.
The solution to such a missing atom type warning can be two-fold. First, the input data was wrong, for example, because formal charges were lost at some point. A typical example is a neutral four-coordinated nitrogen. Second, the input data is correct and the CDK is wrong. This doesn't happen too often, but it does.
This last is the topic of this post. The CDK can be 'wrong' in two ways. Either it doesn't know the atom type (CDK 1.2 and 1.4 have more atom types than CDK 1.0 ever had; so), or the perception algorithm makes a mistake. The latter is not uncommon, particularly for some elements, like nitrogen. The cause here is the input data, or better, the lack of input data. Missing bond orders, missing explicit hydrogens. Now, cheminformatics can make educated guesses, and so the perception algorithm does. But it depends on its education.
Doing the needed education of the CDK basically looks like this commit. It adds:
- a new atom type to the ontology,
- adds code to the CDKAtomTypeMatcher for the perception, and
- adds unit tests to make sure it does the right thing.
- the element (duh),
- the formal charge,
- the number of bound neighbors,
- the number of double bond equivalents (piBondCount),
- the number of lone pairs, and
- the hybridization state (sp3, sp1, etc).