Saturday, October 08, 2011

An ontology for QSAR and cheminformatics

QSAR and QSPR are the fields that statistically correlate chemical substance features with (biological) activities (QSAR) or properties (QSPR). The chemical substance can be molecular structures, drug (which are not uncommonly mixtures), and true mixture like nanomaterials (NanoQSAR). Readers of this blog know I have been working towards making these kind of studies more reproducible for many years now.

Parts of this full story include the Blue Obelisk Data Repository (BODR), QSAR-ML, the CDK for descriptor calculation, the Blue Obelisk Descriptor Ontology (BODO, doi:), still used by the CDK, and in the past by JOELib too, and much, much more. Really, I still feel that the statistics is by far the easiest bit in QSAR modeling.

New in this list of tools to make QSAR more reproducible, is the CHEMINF ontology, which further formalizes cheminformatics computation. In a collaboration with Janna and Christoph (EBI), Michel and Leonid (Carlton University), and Nico (formerly at Cambridge, now at CSIRO), we have cooked up an ontology, and the computational bits of it are captured by the below figure from the paper that just appeared in PLoS ONE.

Both the paper and the ontology have a Creative Commons license. The ontology has already been used by Leonid in other papers, and I have been using it already in the RDF-ed version of ChEMBL.

Next steps for me regarding this ontology is to convert to BODO to be based on CHEMINF, but highly interesting too is a reformulation of QSAR-ML to be based on CHEMINF. The QSAR markup language was long started before RDF came into the picture, so please forgive us for now using RDF from the start there.

One particularly interesting aspect this ontology captures is the difference between molecular entities and mixtures. Not uncommonly, QSAR studies correlate drugs to their binding affinities, even if those drugs are in fact mixtures of stereoisomers. While 0D, 1D, and 2D descriptors are not affected, geometrical descriptors most certainly are. Moveover, the modeled endpoint is very possibly the property of only one of the stereoisomers, most certainly for binding affinities. Yet, many QSAR study reports in literature do not record such details. The CHEMINF ontology defines the terms you need to publish such details.

ResearchBlogging.orgHastings, J., Chepelev, L., Willighagen, E., Adams, N., Steinbeck, C., & Dumontier, M. (2011). The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web PLoS ONE, 6 (10) DOI: 10.1371/journal.pone.0025513


  1. I was trying to find the ontology for statistical operations, and honestly failed to find a robust one. So, we really need a good and robust statistical learning ontology.