Saturday, July 26, 2008

CDK Literature #5

Time flies. Another CDK Literature (see also #1, #2, #3, #4). Quite a few papers have been published again, and I'll briefly discuss a few of them.

Detection of IUPAC names
Klinger et al. have written a paper on detection of IUPAC names. As long as semantic markup languages are not the default, this remains important. Remaining problems include correctly finding boundaries in summaries of chemical. The CDK has been used to create SMILES.
Roman Klinger, Corinna Kolárik, Juliane Fluck, Martin Hofmann-Apitius, Christoph M. Friedrich, Detection of IUPAC and IUPAC-like chemical names, Bioinformatics 2008 24(13):i268-i276; doi:10.1093/bioinformatics/btn181

Structure elucidation
Elyashberg, Williams and Martin wrote a review on structure elucidation and discuss Steinbeck's Seneca software, which uses components of the CDK, though the CDK is not directly mentioned.
M.E. Elyashberg, A.J. Williams, G.E. Martin, Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation, Progress in Nuclear Magnetic Resonance Spectroscopy, 2008, 53(1-2):1-104, doi:10.1016/j.pnmrs.2007.04.003

Opensource Distributed Chemical Computing
Karthikeyan et al. have published ChemStar, an opensource distributed chemical computing system, build on top the Java Remote Method Invocation architecture, used by the original Seneca too. The CDK paper and a Fechner/Guha's CDK News paper are cited in relation to a ChemStar application of benchmarking QSAR descriptors. The article does not seem to mention the opensource license, nor have I yet found a source package download.
M. Karthikeyan, S. Krishnan, A.K. Pandey, A. Bender, A. Tropsha, Distributed Chemical Computing Using ChemStar: An Open Source Java Remote Method Invocation Architecture Applied to Large Scale Molecular Data from PubChem, J. Chem. Inf. Model., 48 (4), 691–703, 2008. 10.1021/ci700334f

Taverna's APIConsumer
Taverna has several means of making functionality available to the workflow engine. SOAP and BioMoby are two prominent ones. The APIConsumer is another one, and described in this paper. The CDK-Taverna project lead by Thomas Kuhn, is mentioned as another project that uses this approach.
Peter Li, Tom Oinn, Stian Soiland, Douglas B. Kell, Automated manipulation of systems biology models using libSBML within Taverna workflows, Bioinformatics 2008 24(2):287-289, doi:10.1093/bioinformatics/btm578

Docking for Substrate Identification
Favia uses docking to recognize interesting substrates for short-chain dehydrogenases/reductases. The CDK's fingerprinter is used to describe intermolecular similarity, by calculating the Tanimoto distances between the bit strings.
Angelo D. Favia1, Irene Nobeli, Fabian Glaser, Janet M. Thornton, Molecular Docking for Substrate Identification: The Short-Chain Dehydrogenases/Reductases, Journal of Molecular Biology, 2008, 375(3):855-874, doi:10.1016/j.jmb.2007.10.065

Wednesday, July 23, 2008

Molecular QSAR descriptors in the CDK

Rajarshi has patched trunk last night with his work to address a few practical issues in the molecular descriptor module of the CDK (and I peer reviewed this work yesterday). One major change is that the IMolecularDescriptor calculate() method no longer throws an Exception, but returns Double.NaN instead. The Exception is stored in the DescriptorValue for convenience. This simplifies the QSAR descriptor calculation considerably, and, importantly, makes it more robust to the input. Though only by propagating errors into descriptor matrix. Just make sure your molecular structures have explicit hydrogens and 3D coordinates, and you're fine.

Anyway, Rajarshi also added a new page to CDK Nightly to list the available descriptors:

Commercial QSAR modeling? Sorry, already patented...

QSAR has been patented in 2001 (US patent 20010049585).

Claim 1:
    A method for predicting a set of chemical, physical or biological features related to chemical substances or related to interactions of chemical substances using a system comprising a plurality of prediction means, the method comprising using at least 16 different individual prediction means, thereby providing an individual prediction of the set of features for each of the individual prediction means and predicting the set of features on the basis of combining the individual predictions, the combining being performed in such a manner that the combined prediction is more accurate on a test set than substantially any of the predictions of the individual prediction means.
They use averaging or weighted averaging of the individual predictions (claim 2). Oh, and just in case you think you are clever and you use 17, 32, etc individual predictions. Sorry, no luck either; you have to use way beyond 1M individual predictions according to the following claim ;)

Claim 2:
    A method according to claim 1, wherein the number of different predictions means is at least 20, such as at least 30, such as at least 40, 50, 75, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10,000, 15,000, 20,000, 30,000, 40,000, 50,000, 100,000, 200,000, 500,000, 1,000,000.
What can I say about this? Please leave your opinion in the comments...

Tuesday, July 22, 2008

Peer reviewed Chemoinformatics: Why OpenSource Chemoinformatics should be the default

The battle for scientific publishing is continuing: openaccess, peer reviewing, how much does it cost, who should pay it, is the data in papers copyrighted, etc, etc.

The battle for chemoinformatics, however, has not even started yet. The Blue Obelisk paper (doi:10.1021/ci050400b) has gotten a lot of attention, and citations. But closed source chemoinformatics is doing fine, and have not really openly taken a standpoint against open source chemoinformatics. Actually, CambridgeSoft just received a good investment. I wonder how this investment will be used, and where the ROI will come from. More closed data and closed algorithms? Focus on services? Early access privileges? At least they had something convincing.

There are many degrees of openness, and many business models. I value open source chemoinformatics, or chemblaics, as I call it. There is a striking similarity between publishing and chemoinformatics. Both play an important role in the progress of sciences. A big difference is that (independent) peer review of published results is done in scientific publishing, but not generally to chemoinformatics. Surely, algorithms are published... Ah, no; they are not. They are described. Ask any chemoinformatician why this subtle difference is causing headaches...

Let me just briefly stress the difference between core chemoinformatics, and GUI applications. The first *must* be opensource, to allow independent Peer Review; the latter is just nice to have as opensource. Bioclipse is the GUI (doi:10.1186/1471-2105-8-59), while the CDK is our peer-reviewed chemoinformatics library (pmid:16796559). I would also like to stress that the CDK is LGPL, allowing the opensource chemoinformatics library to be used in proprietary GUI software. We deliberately choose this license, to allow embedding in proprietary code. The Java Molecular Descriptor Library of iCODONS is an example of this (that is, AFAIK it's not opensource).

So, getting back to that CambridgeSoft investment. I really hope they search the ROI in the added value of the user friendly GUI, and not in the chemoinformatics algorithm implementations, which, IMHO, should be peer-reviewed, thus open source. Meanwhile, I will continue working on the CDK project to provide open source chemoinformatics algorithms implementations, for use in opensource *and* proprietary chemoinformatics GUIs.

Thursday, July 10, 2008

Going to Science Blogging 2008: London

On Saturday 30th of August I'll be in London attending the Science Blogging 2008 event. The Monday following that, I'll meet friends at the EBI, but Sunday is empty so far. I'd love to meet up that Sunday, so just ping me if interested.

Oh, and this blog is using RDFa to markup the event, as discussed here.

Wednesday, July 09, 2008

Chemoinformatics p0wned by cheminformatics...

Noel had a 40 people vote over chemoinformatics versus cheminformatics. What do you think?

I have thrown in two extra options: chemblaics (from my blog: chemblaics (pronounced chem-bla-ics) is the science that uses computers to address and possibly solve problems in the area of chemistry, biochemistry and related fields. The big difference between chemblaics and areas as chem(o)?informatics, chemometrics, computational chemistry, etc, is that chemblaics only uses open source software, making experimental results reproducable and validatable.) and bioinformatics (in case you believe all is life sciences now).

Saturday, July 05, 2008

SVN commit hooks down for CDK and Bioclipse

SourceForge has been playing with system upgrades again, and in an attempt to debug the failing CIA commits on IRC, I reinstalled the hooks for CDK and Bioclipse, so that now all hooks seem to fail, including the email hook... Apparently, it is a known bug, e.g. see this bug report. I assume SF will fix this soon.

On the bright side, I also noted an updated webpage for SF uptime/problem tracker, where it is also reported that stats are currently down for upgrade. There also has an RSS feed, which I recommend as a good monitoring tool for SF site problems.

Friday, July 04, 2008

Moving to Sweden: Improving CDK support in Bioclipse

This autumn I will end my current post-doc position at Plant Research International in the Applied Bioinformatics group and at Biometris (both part of Wageningen University) funded by the Netherlands Metabolomics Center (lot's of vacancies), where I had a good time, and collaborated in several projects within the NMC with much pleasure.

However, personal circumstances strengthened an older wish of me and my family to seek the adventure of living abroad, and a vacancy was available in the group of Prof. Wikberg. So, we are moving to Sweden. There, I will extend my research on effectively combining chemoinformatics (sometimes misspelled as cheminformatics ;) and chemometrics, as I did in my PhD, which fits well with the development of proteochemotrics methodology and Bioclipse as platform to transform scientific hypotheses into data queries.