Sunday, August 03, 2008

"The End of Theory: The Data Deluge Makes the Scientific Method Obsolete"

The thought triggering editorial "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" by Chris Anderson can't have escaped your attention. I was shocked when I read the title and the comments made on the blogosphere and on FriendFeed.

How can he say that?! There is no analysis of data anymore?!? Don't we need to understand why X correlated with Y?!? Etc etc.

So, when I read yet another comment, by my respected opensource chemoinformatician Joerg, I just had to read the piece myself. Joerg disagrees with the statement from Chris' editorial that
    [c]orrelation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
At first, I would agree with Joerg. It's nonsense; any QSAR modeler can explain in details the dangers of overfitting, extrapolation, etc, etc. Not to mention that basically zero mathematical modeling methods can create a statistical signification non-zero regression model with less than 50-100 chemical structures (chemical diversity dependent, etc).

Ok, back to the editorial. There are some arguments on Google, tons of data. Number of incoming links as measure of page importance (brilliant choice, but actually a model, IMHO, which Chris seems to step over). Tons of data. Oh, mentioned that already.

Mmmmm... but wait. Tons of data? The editorial actually refers to petabytes: Petabytes are stored in the cloud. (Whatever the cloud is... just another buzzword, trademarketed too, it seems).

Eureka! Chris is right, Joerg is wrong!

Yes! Then it hit me, Chris is actually correct in his statement, and I was wrong (and Joerg too). If we move away from 50-100 molecules in our QSAR training, but use 10k of chemically alike molecules, then our modeling approaches (if capable of handling the matrices) would have a much, much smaller chance for overfitting, extrapolation (there is much, much more interpolation now), etc. The chances of getting random correlation become insignificant! Actually, Chris is making the argument QSAR modelists have been making for decades: we do not know the mode of action in detail, as we can make, given enough training data, a reasonable regression model to predict the action! Joerg and I have been making the same argument as Chris in our PhD theses! We do not need theory; our QSAR regressions make theory obsolete! (Well, surely, we'd still prefer the theory behind the action, but we lack the measuring techniques to see what actually is happening. Joerg, still agreeing with you, so to say ;)

Except for one thing. Joerg and I suggested 'enough' molecules are required for statistical sound regression. Chris, on the other hand, even makes the point that regression is no longer needed at all at the petabyte scene: we just look up what is happening. Does this hold for chemistry? For QSAR? Petabyte data equals about, say 10kB data per structure, maybe less if we use InChI and neglect conformer info, structures. About 5000 times ChemSpider, if not miscounting the zeros (we don't care about a ten-fold at this scale anymore). Maybe, maybe not. Maybe chemical space is too diverse for that, considering a petabyte of chemical structures is enormously insignificant to the full drugable space (was about 1060, not?)

But not at all? This lookup approach is actually commonly used in chemoinformatics! Even at a way-below-pentybyte scale: HOSE-code-based NMR prediction is a nice example of this! We do not theorize on the chemical carbon NMR shift, we just look it up!

Certainly worth reading, this Wired editorial!

PS. One last remark on the title... I'd say the the scientific method is more than just making theories... I feel a bit left out as data analyst... :( I guess the title should have said 'one of the Scientific Methods'...


  1. Egon

    Very valid points, and a lot of molecular diagnostics falls into the same vein, but underlying all that is some scientific model.

    I love data analysis, but with the idea of trying to figure out why the data analysis is telling us what it does. Without that data analysis is just a tool.

    I think Chris makes some valid points, and you bring most of them forward, but that was what we did in the past in the absence of data, develop drugs without really knowing why they worked. Now we can at least try.

    You don't need to build a hypothesis at every point, but at some point there has to be some fundamental assumption. That could be something as seemingly trivial as the behaviour of hydrophobic groups.

    Even page rank is a model. It gives weight to the quality of a link, so I am still not sure where Chris gets that bit from.

  2. Yes, I noted to that Chris very easily steps over the fact that #link-in ~ relevance is a model too.

    His point is that with this much data, correlation is the only model you need.

    I realized when I went to bed last night, why do I do not like some chemometrics studies... it is those where there is no link the chemical theory. Data analysis becomes so much more interesting when there is a why behind it. To me, science is a good deal of story telling, falsifiable, but still story telling. And without a good story, conference would get rather boring, and take the fun out of science completely...

  3. Important things first, I am a cheminformatician, I still do not like the 'o' ;-)

    @Deepak: I would say that we know a lot of drugs on the market, many people are spending many years of characterizing them. And why 'now'? Does this mean the time anno the Chris Anderson article? I think people have analyzed data from the very early days, probably even before he was born, just calling it now petabyte data does not really change the picture. Compared to infinity is petabyte still a very small amount.

    @Egon: Models are based on data, though some might be based on theoretic or simplistic principles.
    I think any model, must be based on data, otherwise I might have hard time convincing people. I strongly believe that models can not be better than the data quality or the amount of data. If there is enough data, then there is no need at all spending any energy in statistical or molecular modeling. But is this the case?

    Lets face it, how many targets come with XRay data, how many ligands come with DMPK data, how often do we need to repeat experiments, are all HTS experiments confirmed, are all chemical compounds coming with a high purity, how many assays do we need to test, is every compound tested on all possible assays ?

    As said before, we need a combination of both (data and models) for innovative scientific thinking and novel developments.

    Models are not only based on data, but can also help for data outlier detection and experimental design.

    Even the google assumption of link ranking is a hypothesis, as well as their mapreduce mining. And I can not believe that their stemming algorithms (another hypothesis) are ready for chemistry.

    Beside is google really hosting the web as raw data? I doubt so, and since compression algorithms are following some hypothesis, e.g. information theory, I keep repeating myself: we need a combination of both (data and models) for innovative scientific thinking and novel developments

    Finally, data is limited, and is Chris Anderson paying for running all the assay development, chemical synthesis and screening campaigns? I think the EBI innitiative releasing the Galapagos data is outstanding. But I am not sure if anyone can afford that much money for buying all the data hosted within all the biotech and pharma companies worldwide.

    And still, would this be sufficient for making new drugs ... and helping patients... we will see.

    I still believe in data, social networks, modeling, transparency, and crowd sourcing!

    Maybe Chris is right and I am wrong. If he is right, he has the same chance for developing a new drug than I have. Chris, go ahead, fair is fair, and best wishes from my side... maybe I should start writing for the Wired magazine ...?

  4. I don't agree with Anderson due to these reasons.

    1) How many is enough?
    Can we have such an amount of data that correlation supersedes causation in the field of drug discovery? Can we imagine the number of possible organic compounds and the size of related data?

    2) Curse of dimensionality
    Another big problem is the dimensionality of these data. Without reduction of it using appropriate models and scientific theories, can we correlate them well?

    PS) This comment is a summary of my blog post. You may not be able to read it as it is written in Korean.

  5. I might be missing something but I'm not quite sure where these correlations are coming from (or going to). In drug discovery, we measure stuff (like logP) in the hope that it will be predictive of stuff that's difficult to measure (like fatal toxicity in a phase 2 clinical trial). The link between these surrogates and what we would really like to know is pretty weak and I don't see that changing soon. (Take a look at some of the posts in The Crapshoot labelled 'stamp collecting' to see some of the tricks that people use to convince you that correlations are stronger than they really are.)

    The other point that I'd make in a chem(o)informatic forum is that once you talk about chemically similar compounds, you have introduced a model. You could argue that you introduce a model the minute you key data to chemical structure.