Sunday, September 04, 2011

"Cheminformatics? Solved."

I rarely consciously notice the Google Ads in my Gmail inbox, but I guess I do, because my eye fell on this advertisement today:

It is from one of the cheminformatics companies around, and if you wonder why I blog about it, you have a fair point. My point is that I have heard someone very, very knowledgeable person in the field (she might remember me commenting on it) say this about a year ago in Boston. As you probably assume, I disagreed. But I may be wrong, of course. So, I'll bring it up as a discussion point in my blog.

Seriously, if cheminformatics was solved, why are we predicting chemical properties so badly then? Why isn't molecular docking a push-button technology? Why do virtual screening methods fail so often at an (industrial) level (there is enough literature about this; just google a bit)? And why do so many databases get their chemical structures and mixtures wrong (or even distinguish clearly between them)?

Well, clearly, they have been using software from the wrong company. Fairly, there is something to be said there. Some software around do not take specifications seriously (count the number of tools that implement the MDL molfile to the letter; disregarding the unclear or inconsistent parts). And then there are bugs (those who claim their software is 100% bug-free cast the first stone). And user requirements (which not uncommonly lead to deliberately breaking of established standards).

Sadly, cheminformatics is a field with few gold standards. Try looking around for freely (as in speech) available data sets you can use to test your implementation against. You will not find much (and please email me anything you find; or leave it as a comment in this blog). How many software tools you see around report the prediction error in the user documentation, with full detail on validation? How many users actually ask for full disclore when they negotiate a license with a commercial cheminformatics vendor (really interesting question! I love to read some numbers based on a survey on that!)? No one really know or wants to know how well available tools 'solve' cheminformatics (except those who actually wrote the code).

Let's put all this aside; there is another aspect. Last year Rich Apodoca organized an important session at the ACS meeting in Boston on new representations in cheminformatics. Why in the world would he be doing that if cheminformatics was solved? Cheminformatics is a field of corner cases. Hydrogens always have a single neighbor, except when it doesn't. Carbons only have four neighbors, unless there is a double bond when they have three, or a triple bond when they have two neighbors. Except when they have five neighbors. Or have a charge. Or a single electron.

Now, you might wonder if the CDK is the solution here. Obviously, if I am aware cheminformatics is not solved, the CDK must do a great job at at least doing the best job it can, right?. We would love to. With "some" more funding we would have a go at it. But the CDK is behind proprietary products resulting from cheminformaticians that started long before most CDK developers have. I would not dare judging the accuracy of the "Cheminformatics? Solved." claim based on the CDK project. All we try is to be transparent in how we try to solve things.

So, that brings me back to the advertisement. Should I really belief that this company solved cheminformatics? Should I trust them to take the field seriously if they claim they did it? These are rhetorical questions, and there is no right answer. I just think the ad was badly captioned.