Saturday, November 20, 2010

Why you have not heard me much about chemometrics recently...

A casual reader my not know the background of the title of my blog. A bit over five years ago, when I started this blog, I defined chemblaics:
    Chemblaics (pronounced chem-bla-ics) is the science that uses computers to address and possibly solve problems in the area of chemistry, biochemistry and related fields. The general denomiter seems to be molecules, but I might be wrong there. The big difference between chemblaics and areas as cheminformatics, chemoinformatics, chemometrics, proteochemometrics, etc, is that chemblaic only uses open source software, making experimental results reproducable and validatable. And this is a big difference with how research in these areas is now often done.

Later, I also identified molecular chemometrics (doi:10.1080/10408340600969601) when I reviewed important innovation in the field, which has, IMHO, a strong overlap with chemblaics. Any reader of my blog will understand that I see semantic technologies play a very important role here, as Open Standard for communication Open Data between the bench chemist and the data analyst using Open Source allowing others to reproduce, validate, and extend that work. Some have identified the possibilities the internet brings over 10 years ago, while the use of semantic computing goes back even further.

And what have the big publishers done? Nothing much yet. Not the old, not the new. There are project ongoing, and there is a tendency (BioMed Central starts delivering Open Data, Beilstein Institute spits RDF for a few years now, Royal Society of Chemistry has Project Prospect), but most publishers are too late, and not investing enough with respect to their yearly turn over. This is particularly clear if you realize that citizen and hobby scientists can innovate publishing more effectively than those projects (really!). Anyway, I do not want to talk about publishing now, but it is just so relevant: I am not a publisher, but publications are the primary source of knowledge (not implying at all that I think that is the best way; it is not).

Instead, I am a data analyst, a chemometrician, a statistician, a cheminformatics dude, or a (pharmaceutical) bioinformatician, depending on your field of expertise. Really, I am a chemblaics guy: I apply and develop informatics and statistics methods to understand chemistry (and biology) better.

During my PhD it became painfully clear that current science is horribly failing, in many ways:
  • Firstly, we are hiring the wrong people because we care more about a co-authored pipetting paper in Nature, than ground-breaking work in the J. Chem. Inf. Mod. (what journal?! Exactly my point!).
  • Secondly, we have our most bright scientists (the full time professors, assuming that some have been hired for the right reasons) spend most of their time on administrative work (like proposal writing, administrating big national/EU projects).
  • Thirdly, we spent million (in whatever currency) on large projects which end in useless political discussions instead of getting science done.
  • Finally, all knowledge from laborious, hard work is placed in PDF hamburgers and lost to society (unless you spend multibucks to extract it again).

There are likely several more, but these three are the most important to me right now.

So, in the past years after finishing my PhD research which was on data mining and modeling molecular data, I have spend much of my time on improving methods in chem- and bioinformatics to handle. Hardly anyone else was doing it (several Blue Obelisk community un-members as prominent exceptions), but someone has.

Why I have been doing this? Well, without good, curated data sources it is impossible to decide why my (or others) predictive models are not working (as good as we want them to be). Is this relevant? I would say so, yes! The field is riddled with irreproducible studies of which one has no clue how useful they are. Trust the authors who wrote the paper? No, thank you, but I rather verify: I am a scientist and not a cleric. Weirdly, one would have expected this to be the default in cheminformatics, where most stuff is electronic and reproducing results should be cheap. Well, another fail for science, I guess.

So, that explains why I have not recently done so much in chemometrics. Will I return? Surely! Right about now. There is already a paper in press where we link the semantic web to (proteo)chemometrics, and more is to follow soon.

One example, interestingly, is pKa prediction, which has seen quite a few publications recently, yet experimental pKa data is not available as Open Data. Why?? Let me know if you have any clue. Yet, pKa prediction seems to be important to drug discovery, as it gets an awful lot of attention (400+ papers in the past 10 year, of which 50+ in 2010!). But this is about to change. Samuel and I are finishing a project that greatly simplifies knowledge aggregation and curation, as input to statistical modeling. We now have the tools ready to do this fast and efficiently. Right now, I am entering curated data at a speed of about 3 chemical structures a minute. That means, given I need a break now and then, that I enter create a data set of reasonable size in a few days. Crowd-sourcing this, the a small community can liberate data from literature in a few days.

This will have a huge impact on the cheminformatics and QSAR communities. They will no longer have any excuse for making their data not available. There is no argument anymore that curation is expensive. This will also have a huge impact on cheminformatics and chemical data vendors. Where Open Source only had moderate impact so far (several software vendors have already joined the Open Source cheminformatics community), this will force them to rethink their business model. Where they could hide behind curation where it came to text mining initiatives (like Oscar, on which I am currently working), with cheap, expert curation knowledge building at hand, they will be forced to rethink their added value.

The impact on the CDK should be clear too. We no longer depend on published models for ALogP, XLogP, pKa, etc, predictions. Within a year, you can expect the CDK project to release the tools to train your own models, and make choices suitable for your user base. For example, you can make more precise models around the structures your lab works on, or more generic models with large screening projects. Importantly, the community will provide an Open Data knowledge base to start from. Using our Open Standards, you can plug in your own confidential data and make mixed, targeted models.

Is this possible with the cheminformatics of the past 30 years? No, and that's the reason why I have been away from chemometrics for a while.


  1. The one point for was really strange for me: Why data is closed in PDF? Why the journals doesn't have rules to annotate the results in straight form?

    I just have 2 answers
    1. Don't want to waste precious time
    2. Scientific community don't want the results to be easily interpretable (so the journals also). Sounds crazy, yeah.

    For it was simply clear that every publication should have the results and process of getting this results in annotated form, even if there is no rules for them

  2. Answer 1 is funny. Do you know how many hours a scientists spends on getting the layout of a submission correct?

    Example is data 'lost' in PDFs include images of 2D diagrams instead of connection tables.

    Also, I think publishers should have started validating chemistry on submission, instead of making sure italics are used properly in the bibliography. BTW, going into that area, why don't the just ask a list of DOIs to fix 80% of all cited literature in chemistry papers?

    And why don't they have a simple web form to create (parts of) the experimental section of organic chemistry journals? Like proper formatting of NMR spectra? Why did they not start requiring chemical data to be deposited publicly? Why did they not start a deposition system themselves? For 10 years I have now been wondering why they have not monetized experimental data, while they do not allow someone else to extract the data... (is the CAS paying them *that* much?)

    There are many things they could have done, which they have not. Some publishers have. CIF files *are* validated, in many ways.

    Saving precious time? No, I don't buy that argument... if the had done their work right, they would have both annotated results *and* saved time of the author.

  3. "Why data is closed in PDF?"

    - Because publishing has been traditionally targeted at human readers;

    - Because there is no unambiguous way to annotate data. Even if pKa data is extracted automatically, how does one sets the metadata right, at this moment, not in the bright new future of semantic web (e.g. what experimental method was used to measure it)?

  4. > Even if pKa data is extracted automatically, how does one sets the metadata right, at this moment, not in the bright new future of semantic web (e.g. what experimental method was used to measure it)?

    Indeed! That's why I started *manually* extracting this data from literature (and making it available with semantic web technologies), while allowing us to annotate pKa values with experimental error and methods/protocols to use them.

    The tools we have make it possible to do this at reasonable speed to make this process suitable for manually curating large volume of data.