Saturday, June 16, 2018

Represenation of chemistry and machine learning: what do X1, X2, and X3 mean?

Modelling doesn't always go well and the model is lousy at
predicting the experimental value (yellow).
Machine learning in chemistry, or multivariate statistics, or chemometrics, is a field that uses computational and mathematical methods to find patterns in data. And if you use them right, you can make it correlate those features to a dependent variable, allowing you to predict them from those features. Example: if you know a molecule has a carboxylic acid, then it is more acidic.

The patterns (features) and correlation needs to be established. An overfitted model will say: if it is this molecule than the pKa is that, but if it is that molecule then the pKa is such. An underfitted model will say: if there is an oxygen, than the compound is more acidic. The field of chemometrics and cheminformatics have a few decades of experience in hitting the right level of fitness. But that's a lot of literature. It took me literally a four year PhD project to get some grips on it (want a print copy for your library?).

But basically all methods work like this: if X is present, then... Whether X is numeric or categorical, X is used to make decisions. And, second, X rarely is the chemical itself, which is a cloud of nuclei and electrons. Instead, it's that representation. And that's where one of the difficulties comes in:
  1. one single but real molecular aspect can be represented by X1 and X2
  2. two real molecular aspects can be represented by X3
Ideally, every unique aspect has a unique X to represent it, but this is sometimes hard with our cheminformatics toolboxes. As studied in my thesis, this can be overcome by the statistical modelling, but there is some interplay between the representation and modelling.

So, how common is difficulty #1 and #2. Well, I was discussing #1 with a former collaborator at AstraZeneca in Sweden last Monday: we were building QSAR models including features that capture chirality (I think it was a cool project) and we wanted to use R, S chirality annotation for atoms. However, it turned out this CIP model suffers from difficulty #1: even if the 3D distribution of atoms around a chiral atom (yes, I saw the discussion about using such words on Twitter, but you know what I mean) does not change, in the CIP model, a remote change in the structure can flip the R to an S label. So, we have the exact same single 3D fragment, but an X1 and X2.

Source: Wikipedia, public domain.
Noel seems to have found in canonical SMILES another example of this. I had some trouble understanding the exact combination of representation and deep neural networks (DNN), but the above is likely to apply. A neural network has a certain number if input neurons (green, in image) and each neuron studies one X. So, think of them as neuron X1, X2, X3, etc. Each neuron has weighted links (black arrows) to intermediate neurons (blueish) that propagate knowledge about the modeled system, and those are linked to the output layer (purple), which, for example, reflects the predicted pKa. By tuning the weights the neural network learns what features are important for what output value: if X1 is unimportant, it will propagate less information (low weight).

So, it immediately visualizes what happens if we have difficulty #1: the DNN needs to learn more weights without more complex data (with a higher chance of overfitting). Similarly, if we have difficulty #2, we still have only one set of paths from a single green input neuron to that single output neuron; one path to determine the outcome of the purple neuron. If trained properly, it will reduce the weights for such nodes, and focus on other input nodes. But the problem is clear too, the original two real molecular aspects cannot be seriously taken into account.

What does that mean about Noel's canonical SMILES questions. I am not entirely sure, as I would need to know more info on how the SMILES is translated (feeded into) the green input layer. But I'm reasonable sure that it involves the two aforementioned difficulties; sure enough to write up this reply... Back to you, Noel!