Wednesday, January 30, 2008

Why chemistry-rich RSS feeds matter...

Peter wrote up an item on Nick's CrystalEye's RSS feed, and I have been enthusiastic about chemistry-enriched RSS feeds for some time. CMLRSS has the chemical data inline in the RSS; see DOI:10.1021/ci034244p, the use of CMLRSS in Chemical blogspace described here and here, and the CMLRSS support in Bioclipse.

Nick's RSS feed does not put the chemistry inline, but does link to the raw CML file:
<title>No title supplied</title>
<link rel="enclosure" href="" hreflang="en" />
<!-- much more, that I skipped for brevity -->
The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services.

Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)).

Done? Checked it? You saw the problem, right? Good.

I have scanned the CIF source, but that does not seem to contain the problem. It nicely shows a general limitation of commonly used chemoinformatics tools: the lack of proper atom typing (a problem I have been looking into for the Chemistry Development Kit; see Atom Typing in the CDK and Evidence of Aromaticity.).

You will have noted that the 2D diagram in Peter's blog is charged. I checked the complete CML source code for the CrystelEye entry, and that contains the charges on the two oxygens bound to the cupper too. However, the copper is not charged. That leads to a rather unlike situation; that is, that crystal structures will about attract the whole laboratory to itself in a blink of an eye: there is nothing to balance the double-negative charge! It is conveniently summarized in this bit of the CML:
<formula formalCharge="-2" concise="C 10 H 6 Cu 1 N 4 O 4 -2">
<atomArray elementType="C H Cu N O" count="10.0 6.0 1.0 4.0 4.0"/>
Now, I also checked the raw CML; that seems to be unaffected too. So, the bug must be somewhere in the software that converts the raw CML into complete CML. And, before the InChI calculation, because that one is wrong too. A agent scanning the RSS feed, would have detected this. Someone interested in writing up a grant proposal on this?

BTW, the system is not awfully wrong: the negative charge on the acidic carboxyl groups is to be expected. But if the bond between the oxygen and the carbon would have been coordinating, not covalent, and the copper would have been +2, then it was fine. Because many chemoinformatics tools do not have really support for dative bonds, a covalent bond could be drawn, but then the oxygens should be uncharged... right, not? :)

Oh, and surely, one can do much, much more with those feeds. I blogged about that earlier in Automatic Classification of thousands of Crystal Structures.

Wednesday, January 23, 2008

My PhD Thesis: in color and grayscale

Wednesday is my regular day off from my metabolomics work, and today I am finalizing the layout of my thesis, which I'll defend on April 2. The print version will feature grayscale images with some of them in color too. However, the PDF version that will end up in our university repository should have color prints. So, while halfway creating suitable grayscale versions of the image, I realized I was not doing it properly. I was replacing the images; so, I lost the color version. Not good.

But wait, LaTeX can do more; why not have a color and a grayscale option? Here comes optional.sty. By adding \usepackage{optional} I can add to the source (from book.tex):
\caption{a) 2D diagrams of the two possible resonance structures of a compound
with a phenyl ring. Both diagrams refer to the same compounds, but the depicted
graph representations are not identical. b) 2D diagram of ferrocene, which,
like all organometallic compounds,
is difficult to represent with classical chemoinformatics approaches.}
Ferrocene was already black-and-white, so no worry about that. And, it is just the red colored hydroxyl group. But it serves the point :)

Which then allows me to run pdflatex to create a color version and a grayscale version:
pdflatex "\def\UseOption{color}\input{book}"
pdflatex "\def\UseOption{grayscale}\input{book}"

/me is happy

Sunday, January 20, 2008

Java Server Pages with CDK functionality

Setting up interactive web pages can be done in many way. Java Server Pages are just one of them. They are quite similar to PHP pages or Ruby, and combine plain HTML (and likely any other output) code with fragments of code; Java source code in this case.

Ubuntu's tomcat5.5 package installs quite easily, and sets up a server at port 8180. I still have to figure out how to nicely integrate it with the Apache server on port 80, though. Suggestions much appreciated.

From then on, one can add new JSP pages by creating a 'webapp' in /usr/share/tomcat5.5-webapps. The basic structure looks like:
Just copying the large CDK jar (the one with all the third party libraries) into WEB-INF/lib/ did not work for me, but unjaring it into WEB-INF/classes/ seem to work fine.

Then, you can just add Java code using the CDK library for what ever you like. The following (simple) example JSP page, takes one parameter, a molecular formula. This could be the input given in a FORM, but the below page does not deal with that situation yet:
<%@ page import="java.util.*,org.openscience.cdk.*,*" %>
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
String mf = request.getParameter( "mf" );
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Author" content="E.L. Willighagen">
<title>Metabolomics Examples</title>

<body bgcolor="#FFFFFF">

<td>Molecular Formula:</td>
<td><%= mf %></td>

MFAnalyser analyser = new MFAnalyser(mf, new Molecule());
double accurateMass = Math.round(analyser.getMass()*10000.0)/10000.0;

<td>Mono-isotopic Accurate Mass:</td>
<td><%= accurateMass %></td>

Now, a lot of improvement can be achieved. For example, the <head> stuff can be split out in a header.include. And, after proper integration with the Apache server, rewrites could be used to create a REST service. But, the above is just to give you an idea.

In case you wonder, this work is related to the opensource MetWare database software development our group is involved in.

Tuesday, January 15, 2008

Be in my Advisory Board #2: JChemPaint development

No idea who the 22 persons are who were willing to join my advisory board, but they advised me to finish the JChemPaint work Niels worked on this summer:

Like my current main hobby project (atom typing in the CDK), the JChemPaint project will be performed in my non-working hours mostly. A reasonable ETA is, therefore, end of this summer. Main discussion on will be done on the cdk-jchempaint mailing list. Also note that the JChemPaint project at SourceForge is deprecated, because the code is now included in the CDK project.

Over the next weeks, I will post more questions in the poll of this blog regarding the JChemPaint development. So, watch that space.

Sunday, January 06, 2008

CDK Literature #4

Fourth in the CDK Literature series. Really, a follow up on #3 which I wanted to get out, even though not really finished yet. But, after 3 comes 4, not 3b. Maybe 3.1, but that suggests at least 3.2-3.9 too, let alone full R (that was supposed to the space of all reals...) I'll stick to positive non-zero integers. #1 and #2 are still available too.

Another thing I should remark is that this series does not provide full reviews of the cited papers. Instead, it provides a list of papers that cite one of the two CDK papers (doi:10.1021/ci025584y and doi:10.2174/138161206777585274). It is worth repeating that these two articles are not the only article which describe CDK source code, and maybe I should start listing papers that cite other articles that discuss CDK source code. Anyway, the other papers discussing CDK source code are listed in an overview I maintain in the CDK wiki (now on SourceForge), but have not updated it for #4 and #3 yet.

Organic Reaction Ontology
Punnaivanam Sankar and Gnanasekaran Aghila published a paper where they propose a knowledge framework for mechanisms of organic reactions, and used an XML framework combined with a ontology for the semantics. JChemPaint and the CDK are cited as opensource tools that support reactions.
Punnaivanam Sankar, Gnanasekaran Aghila, Ontology Aided Modeling of Organic Reaction Mechanisms with Flexible and Fragment Based XML Markup Procedures, J. Chem. Inf. Model., 2007, 47(5):1747 -1762, doi:10.1021/ci700043u

Ma et al. have published a (Chinese) QSAR paper where CDK descriptors have been as molecular represention of a data set with 212 ligands for the P-glycoprotein. Models have been build with Random Forests, and classification success rates for the test set of around 85%.
Guang-Li Ma, Xiao-Ping Zhao, Yi-Yu Cheng, Identification of P-gp substrates using a random forest method based on chemistry development kit descriptors, Chemical J. of Chinese Universities-Chinese, 2007, 28(10):1885-1888

Chemical Databases
sMOL is GPL-licensed software for setting up a small molecule database. The software uses JChemPaint, OpenBabel, JOELib and the CDK for chemoinformatics functionality, and R and Weka for statistical analyses. I have not locally installed it yet, but the User Guide shows really nice screenshots. The Installer Guide shows a quite polished product too. Not sure how open the project is to contributions from others (patches, translations, etc, but will ask.
Supawadee Ingsriswang, Eakasit Pacharawongsakda, sMOL Explorer: an open source, web-enabled database and exploration tool for Small MOLecules datasets, Bioinformatics, 2007, 23(18):2498-2500, doi:10.1093/bioinformatics/btm363

Free Tools
Bruno Villoutreix wrote an overview of free (as in free beer) services to aid virtual screening. It cites the CDK, Jmol, OpenBabel as tools, along with a long list of free but proprietary tools. It does explicitly plead for opensource docking and scoring tools, and, as such, potentially useful in grant proposals.
Bruno Villoutreix, Nicolas Renault, David Lagorce, Olivier Sperandio, Matthieu Montes, Maria Miteva, Free Resources to Assist Structure-Based Virtual Ligand Screening Experiments, Current Protein and Peptide Science, 2007, 8(4):381-411, doi:10.2174/138920307781369391

Fangping Mu et al. have set up a KEGG-derived database with annotated reactions where atoms between reactants and products are mapped, to help data analysis of isotopomeromics data. The CDK rendering features are used for visualization purposes. The software also builds on BioMeta, work by Martin Ott, presented at last years CDK Workshop.
Fangping Mu, Robert Williams, Clifford Unkefer, Pat Unkefer, James Faeder, William Hlavacek, Carbon-fate maps for metabolic reactions, Bioinformatics, 2007, 23(23):3193-3199, doi:10.1093/bioinformatics/btm498

I got two more papers lined up, but do have access to Current Pharmaceutical Design.

Thursday, January 03, 2008

CDK Literature #3

Third in a series summarizing literature citing one of the two CDK articles. See also #1 and #2.

Two reviews have recently appeared which cite the CDK. Ricard Stefani has written a review in Portuguese of the many NMR-based elucidation tools on computer-aided structure elucidation. The CDK is cited as a general chemoinformatics tool. It also cites SENECA which uses CDKs structure generators.
Ricardo Stefani, Paulo Nascimento, Fernando Da Costa, Computer-aided structure elucidation of organic compounds: Recent advances, Quimica Nova, 2007, 30(5):1347-1356, 2007, doi:10.1590/S0100-40422007000500048

Dimitris Agrafiotis has written a overview of the current state of chemoinformatics, and the CDK is cited as tool to calculate molecular descriptors. (Jörg is co-author, and he blogged about this article too).
Dimitris Agrafiotis, Deepak Bandyopadhyay, Jörg Wegner, Herman van Vlijmen, Recent advances in chemoinformatics, J. Chem. Inf. Model., 2007, 47(4):1279-1293, doi:10.1021/ci700059g

1H proton coupling prediction
I wrote up a separate blog item on this the article Janocchio: Jmol and CDK based 1H coupling constant prediction written by David Evans at Eli Lilly.
David Evans, Michael Bodkin, Richard Baker, Gary Sharman, Janocchio - a Java applet for viewing 3D structures and calculating NMR couplings and NOEs, Magnetic Resonance in Chemistry, 2007, 45(7):595-600, doi:10.1002/mrc.2016

Quantitative-structure-activity-relationship (QSAR) modeling projects are finding their way to the CDK too. Dmitry Konovalov cites the CDK as a free source (as in gratis) for descriptor calculation and touches the problem of reproducibility of descriptor calculations. Unfortunately, it does not discuss initiatives like the descriptor ontology as is discussed in the second CDK article, or the efforts discussed in the Blue Obelisk paper (doi:10.1021/ci050400b), such as the Blue Obelisk Data Repository which aim to improve this reproducibility.
Dmitry Konovalov, Danny Coomans, Eric Deconinck, Yvan Vander Heyden, Benchmarking of QSAR models for blood-brain barrier permeation, J. Chem. Inf. Model., 2007, 47(4):1648-1656, doi:10.1021/ci700100f

SOAP webservices
Xiao Dong and the rest of the Indiana team have set up SOAP webservices, among many wrapping CDK functionality, such as descriptor alculation, 2D similarity and fingerprint calculations, and 2D structure depiction. They also set up a service for toxTree, which itself uses the CDK too.
Xiao Dong, Kevin Gilbert, Rajarshi Guha, Randy Heiland, Jungkee Kim, Marlon Pierce, Geoffrey Fox, David Wild, Web service infrastructure for chemoinformatics, J. Chem. Inf. Model., 2007, 47(4):1303-1307, doi:10.1021/ci6004349

Wednesday, January 02, 2008

Open Lab 2007 results

The results for the Open Lab 2007 are out. I participated in this endeavor as judge, and read 75 of the 486 blog items, focusing on the sections chemistry, blogging, publishing, politics of science, and a number of blog items with few reviews when I passed them.

I am happy to see that one of the chemistry submission I made myself made it into the anthology: the Depth-First item on SMILES and Aromaticity: Broken?. Congratulations, Rich!

Collaborative work with Bioclipse

Ola blogged about something he is working on for Bioclipse2. The next major series of Bioclipse releases will use the RCP-based resource architecture, which allows better integrating with other RCP plugins, such as the Subclipse plugin which allows one to browse Subversion repositories directly in Bioclipse. That is cool! Check out the screenshot he posted in his blog.

Now, this kind of integration is important. Subversion is a tool to collaboratively work on data, which can be open source (e.g. the Bioclipse source code), open data (e.g. the Blue Obelisk Data Repository), or any other kind. However, unlike tools like Google Docs, Bioclipse with Subversion supports provides you with a rich client to process your data. No longer need for putting SMILES into a spreadsheet, just put the full 3D structure or NMR spectrum in your joint resource set. This is much more suited for Open Notebook Science, right Jean-Claude? Just put in the raw data as it came out of the spectrometer, and let Bioclipse deal with data extraction. Oh, did you that Bioclipse has Oscar3 integrated (which has not been updated to the latest release, though)?

Why bother with Wikis and Google Docs if you have Bioclipse? Why, even, bother with ICE?