Wednesday, August 04, 2010

Is there an Open Specification for structure normalization?

Over at the Blue Obelisk eXchange, I just posted this question:
    Normalization is an important step in many cheminformatics workflows. Picking the right representation for a nitro-group, for example.
    Are there best practices here? Should we initiate an Open Specification for normalization steps that should be performed? This would greatly increase the reproducibility in cheminformatics…
Please post your ideas, comments, etc.


  1. I'm not sure if that would help since it's up to the user and used software what kind of normalization you would want. If your software can't handle saltdata in the molfile (like the layout or the qsar.descriptor modules of the CDK) you want saltdata in a seperate field. If your software can handle it you probably want it in the CT for duplicate checking reasons. The same goes for tautomers, stereochemistry and so on.

  2. I know Paul Dobson showed his Pipeline Pilot standardizing workflow. The workflow
    It was used in his Metabolite-likess publication

  3. i do not know, what exactly you mean with structure normalization, but there exists guidelines from IUPAC how to draw organic compounds.

  4. Anomymous, I am more thinking of things like how to represent a nitro-group, if acid groups should be charged or neutral, if bond orders should be localized or delocalized, etc, etc. This differs from one application to another, but it could be useful if for each application domain a standard was set, which would make computational results easier to compare.

  5. you mean normalization like "The IUPAC Chemical Identifier – Technical Manual" Chapter IVb?

    what people often do not understand, is that a connection table which is useful for cheminformatics can be rendered to various depictions. therefore the InChI approach seems for me a good basis (although i know, it is a "dictionary" approach, not a fundamental algorithmic solution). and it does not contradict depiction of doi:10.1351/pac200880020277 GR8.1.