Pages

Friday, October 31, 2008

Next generation asynchronous webservices

Johannes joined a Bioclipse Workshop a long time ago, and introduced the participants to the idea of using XMPP (aka Jabber) for asynchronous web services. SOAP is commonly user to run webservices over HTTP, but via (SMTP) email and XMPP is possible too (see SOAP over XMPP). Using HTTP as transport layer has problems. The biggest problem, is possibly that HTTP connections are timed out, e.g. by intermediate router. This makes it rather unsuited for long running jobs. Workarounds are easy to come up with, and polling is a common solution.

Johannes ideas solve this limitation by using the general XMPP protocol for chatting:
client
he, can you do something for me?
service
sure, I can do generate3Dcoordinates and generateSMILES.
client
ah, nice! what input does generateSMILES take? and the output?
service
input: CML, output a simple string.
client
ok, here's the CML
service
I'm done now. sorry that it took 10 minutes, but I'm running Vista...
client
excellent, please send me the results
service
ok, here is the SMILES for lacosamide: CC(=O)N[C@H](COC)C(=O)NCC1=CC=CC=C1

Well, the important bit is in the last line. A job may take lone, even on clusters. The client might have to reboot meanwhile (possibly because of critical security updates)... the service will just continue, and send you a message when done. If you just happen to be offline, it will send a message when you are back online.

Johannes ideas led to the IO-DATA proposal (XEP-0244), which is currently marked experimental and being discussed on the ws-xmpp mailing list. He gathered a few people around him to get it going, resulting in working stuff! Yeah!

Chemistry Development Kit XWS
Besides contributing to the proposal, I am also involved in this project by writing XMPP-webservices, for the CDK. This brings me to cdk-xws, which is the project to bring CDK functionality online as webservices using IO-DATA.

This shows three nodes, the first being the CDK service, with two functions, of which I only implemented one yet.

For the curious, this is what the XMPP messages look like:
<iq from="egonw@ws1.bmc.uu.se/home" 
id="JSO-0.12.5-6"
to="cdk.ws1.bmc.uu.se"
type="set">
<command xmlns="http://jabber.org/protocol/commands"
action="execute"
node="calculateMass">
<iodata xmlns="urn:xmpp:tmp:io-data"
type="input">
<in>
<smiles xmlns="urn:xws:cdk:input">CCC</smiles>
</in>
</iodata>
</command>
</iq>
<iq from="cdk.ws1.bmc.uu.se"
id="JSO-0.12.5-6"
to="egonw@ws1.bmc.uu.se/home"
type="result"
xml:lang="en">
<command xmlns="http://jabber.org/protocol/commands"
node="calculateMass"
sessionid="XWS-1"
status="completed">
<iodata xmlns="urn:xmpp:tmp:io-data"
type="output">
<out>
<mass>36.032207690364004</mass>
</out>
</iodata>
<note type="info">Done</note>
</command>
</iq>

Embedding Gists in blogs

Mark pointed me to the embed functionality of Gist, product on GitHub where I host some todo software and a git mirror of CDK 1.2.x.

So, the other day, when I blogged about Bioclipse2 scripts, I should have embedded the script like this:

Saturday, October 25, 2008

Bioclipse2 Scripting #1: from SMILES to a UFF optimized structure in Jmol

After some difficulties this week with making an export of CDK plugins in the Bioclipse2 Cheminformatics feature of with the cdk-eclipse software, I got the following cute Bioclipse2 script up and running:
dimethylether = cdk.fromSMILES( "COC" );
cdk.addExplicitHydrogens( dimethylether );
cdk.generate3dCoordinates( dimethylether );

// save as CML
cdk.saveCML( dimethylether, "/Virtual/dimethylether.cml" );
ui.open( "/Virtual/dimethylether.cml" ); // this should open a JmolEditor

jmol.minimize();
You can see four of my favorite cheminformatics tools integrated: CDK is used to convert a SMILES into connection table with add explicit hydrogens, and to create initial 3D coordinates (with the code from Christian Hoppe, and thanx to Stefan for fixing that code in the CDK 1.1.x branch!). Then, CMLDOM is used to create and save a CML document, which is then opened into a Jmol editor in Bioclipse.

A variation of this script is visible in the following screenshot:

This and other Bioclipse2 scripts I will post in Gist, a sort of pastebin supporting version history, and I'll tag them with bioclipse gist on delicious, so that you can always browse them, comment on them, or add your own gists at http://delicious.com/tag/bioclipse+gist.

Friday, October 24, 2008

Git-Eclipse integration

Recently, I have been blogging about Git: One concern expressed by people was the lack of integration with IDEs. Now, an Eclipse plugin seems well on its way:

With a experimental update site (http://www.jgit.org/update-site), the plugin is just an Eclipse reboot away.

Now, the plugin is still in its early stages and many open feature requests, but fortunately the bug tracker can easy be integrated with Mylyn, and is still actively developed.

Cheers to Shawn and Robin for their work!

Monday, October 20, 2008

Bugzilla Eclipse IDE integration: Mylyn

A new environment means new tools. Bioclipse is Eclipse RCP-based, so colleagues work with Eclipse and are much more into Eclipse too. For example, into Mylyn. Mylyn is a tool to track tasks and assign context to them. The tasks I am interested in (for this blog item), is fixing bug reports. Mylyn is rather suited for this, as it allows linking Java source files to bug reports. With a growing list of projects in my navigator, browsing them becomes difficult because the list is way too long. Mylyn allows me to only show those source files which are actually related to the bug I am fixing. Cool!

However, SourceForge, our bug tracker, integrates, but to too limited functionality. Bugzilla, though, has excellent integration. And curious about what that would look like, I installed Bugzilla on an Ubuntu system. Which failed. Due to a bug know for two years already! Anyway, two tweaks to the system got it working!
  1. Work around the password in the postinstall script (see here)
  2. Set up a /bugs/ link (see here)
This is Bugzilla as viewed in Mylyn:

(The bug content is derived from Ubuntu bug #1.)

GitToDo support for Freemind: graphical mapping of important things on my schedule

About a week ago, I hooked up my GitToDo software with Freemind. This allows me to organize the projects I am working on, without having to code this in GitToDo directly. I also immediately take advantage of visualization, for example, adding an icon for projects with one or more TODO items marked TODAY or URGENT:

Keeping my GitToDo repository synchronized is as easy as typing:
gtd-freemind-update
gtd-freemind-show

Chemical Editing...

As you might have seen, we, Uppala and the EBI, are working on the next generation JChemPaint. JChemPaint is an editor, and therefore, consists of a mode (IChemModel), a view (IRenderer) and a controller (IController). See the many posts in Gilleain's blog.

For the renderer I have set up a wiki page which I'll be hacking in the next days, which shows how a IChemObject content should be rendered in JChemPaint. It looks like:



The IController is a rather important part too, and like the IRenderer bit of JChemPaint, needs a major overhaul. The new design, discussed by Gilleain here and here, should, IMHO, look like:

In this diagram, the gestures can come from any input device, mouse, tracking ball, Wiimote, and will result in events in some widget library (SWT, AWT shown). The old JChemPaint, converted the Swing MouseEvent's directly into IChemObject modifications, making the code incompatible with SWT. This is why the Chemical Editing Events layer must be added.

Events in this layer look like addAtom(attachementAtom, coordinates) and setFormalCharge(atom, newCharge). The link to scripting should be clear now, and will help use write unit tests for this layer.

Chem-bla-ics turns 3!

Five days ago, my chem-bla-ics blog turned 3. Here's the first post. It defined:
    chemblaics is the application of open source software in cheminformatics, chemometrics, proteochemometrics, etc, making experimental results reproducable and validatable.
Much has changed to the field since that post, for the better of chemical sciences.

Saturday, October 18, 2008

Chemoinformatics p0wned by cheminformatics... #2

Some time ago Noel ran a poll on chemoinformatics and cheminformatics, so I set up a poll too in part #1 of this series. The outcome is clear:



The Obernai meeting strongly suggested chemoinformatics [1], but the start of the open access Journal of Cheminformatics is the killer. I can no longer resist: I'll follow the wish from my advisory board, and the general trend around the world (except India).

The journal's editor-in-chief is David Wild, while Christoph Steinbeck seems to be going to lead the European branch. People seem to like the idea. The journal will clearly be in direct competition for market share with the JCIM, QSAR & Combinatorial Science, and even the open access Chemistry Central Journal. Interesting to see where this is going...

Tuesday, October 07, 2008

Jmol 11.6 RC 18 in Bioclipse

Just updated Bioclipse2 with Jmol 11.6 RC 18:

Now working in Uppsala makes Bioclipse my default life sciences platform, and I'll be porting older Bioclipse1 plugins to Bioclipse2, which has a much better architecture.

Bioclipse2 does not have a native Jmol Console, but script commands can easily be run with jmol.run() (written by Jonathan). I wonder if it would be hard to have a JmolScript view like this JavaScript Console... The outline on the right (written by Ola) allows me to navigate the Jmol data model.

Monday, October 06, 2008

pKa prediction, or, how to convert a JCIM paper into Java

Lee et al. published last week a paper on pKa prediction (doi:10.1021/ci8001815). As the paper says, the pKa, and in particular the ionic state of a molecule at physiological pH, affects pharmacokinetics and pharmacodynamics. The paper describes a (binary) decision tree using presence or absence of SMARTS substructures to traverse the tree, allowing prediction of monoprotic molecules.

Now, the paper's Supplementary Information contains the full model. I'd rather rebuild the model, but the full training set does not seem available. Still, the paper's model shows comparible predictive power as commercial models, so I'd say it would be a welcome addition to the CDK.

And as the CDK already has a SMARTS parser, adding this model should be easy enough. So, here goes :) First, let us outline the API:
/* $Revision$ $Author$ $Date$
*
* Copyright (C) 2008 Egon Willighagen <egonw@users.sf.net>
*
* Contact: cdk-devel@list.sourceforge.net
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License
* as published by the Free Software Foundation; either version 2.1
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
*/
package org.openscience.cdk.charges.pka;

/**
* Tool to predict a molecule's pKa. The class implements
* the algorithm published by Lee et al. {@cdk.cite Lee2008}
* which is based on a SMARTS-based decision tree, trained
* with 1693 monoprotic compounds.
*
* @cdk.module extra
*/
public class PkaPredictor {

/**
* Predicts the pKa value of a molecule.
*
* @param container IMolecule to predict the pKa for
* @return The predicted pKa
*
* @throws CDKException upon failure of the prediction algorithm
*/
public static float predict(IMolecule container) throws CDKException {}

}
The first line is picked up by SVN, which will add the revision number, the last commiter and when the last commit happened. The third line is important: it indicates who has the right or need to be asked permission to modify the license, if ever needed. If people provide patches to the code, they are added to this list. The rest of the source file header includes a general contact email address, and the LGPL v2 license the CDK uses. The package declaration puts it in the cdk.charges.pka package, which seemed appropriate.

The class JavaDoc contains two CDK specific tags. The tag ${cdk.cite Lee2008} is used to point to the literature reference database in doc/refs/cheminf.bibx. When the HTML JavaDoc is compiled, the full reference gets included in the HTML. The other tag, @cdk.module is used by the CDK build system to determine in which CDK module the Class should end up; extra in this case. The method's JavaDoc is pretty default.

Next, we need some logic to traverse the look up the predicted pKa from the decision tree, and I implemented this as:
public static float predict(IMolecule container) throws CDKException {
if (node1 == null) initalize();

DecisionTreeNode node = node1;
// traverse down tree until we end up in a leave
while (!node.isTerminal()) {
node = node.decide(container);
}
return node.getValue();
}
The root node of the tree is called node1, and I explain its initialization later. Then, the code traverses the tree by asking each node to decide whether the SMARTS substructure is present or not. It returns a new DecisionTreeNode matching the presence or absence. At some point, the terminal node is reached, and we can ask this node the associated prediction value.

The Java version of the Decision Tree

The paper's supplementary information contains the tree encoded like this:
1,0,,,5.9131093
2,1,[#G6H]C(=O),1,3.6849957
3,1,[#G6H]C(=O),0,7.206913
That is, each line lists the node identifier, the parent identifier, the SMARTS query, presence (1) or absence (0), and the node value. Actually, a bit more, but these are the important bits for now. The first line is the root node node1, and the second and third line the two children of the root node. If the [#G6H]C(=O) substructure is present, then node2 applies, and the predicted value would be 3.6849957; if the substructure is absent, then node3 applies, and pKa 7.206913.

Now, these nodes and there interdepencies are encoded in the initialize() method as:
private static void initalize() throws CDKException {
node1 = new DecisionTreeNode(5.9131093f, 17.32f);
DecisionTreeNode node2 = new DecisionTreeNode(3.6849957f, 5.9569998f);
DecisionTreeNode node3 = new DecisionTreeNode(7.206913f, 17.32f);
node1.setChildNodes("[#G6H]C(=O)", node2, node3);
}
The second argument in the DecisionTreeNode constructor is the value range for the node, and is an indication of the variance of the prediction value.

A simple Perl script can convert the file from the supplementary information into Java source code. With more than 1500 nodes in the tree, this beats manual hacking up of the tree.

The JUnit4 test

The unit tests now looks like:
package org.openscience.cdk.charges.pka;

/**
* Unit test to test the functionality of the {@link PkaPredictor}.
*
* @author egonw
* @cdk.module test-extra
*/
public class PkaPredictorTest extends NewCDKTestCase {

@Test public void testThrowsNoException() throws Exception {
IMolecule methane = NoNotificationChemObjectBuilder.getInstance().newMolecule();
IAtom carbon = methane.getBuilder().newAtom(Elements.CARBON);
methane.addAtom(carbon);

float result = PkaPredictor.predict(methane);
// the actual value depends on the number of nodes I actually added,
// but I *do* know the min and max without having to have all nodes
// implemented
Assert.assertTrue(result < 15.526);
Assert.assertTrue(result > -0.6659999);
}
}
Note that I cannot assert the real prediction value until the full decision tree has been implemented in the class, but I do note the full range and thus test for that. You may have noted that several methods throw CDKException's, which would have been caused by SMARTS expressions the CDK cannot handle...

SMARTS problems...

Now, the SMARTS used in the supplementary information indeed do not work with the CDK SMARTS engine; the paper indicates that they used MOE which extends the original Daylight SMARTS. So, if you ever wondered about the forking risk of Open Standards...

So far, I have identified these three patterns used in the paper's model, but not parsable by the CDK engine:
  1. [i] a SP2 hybridized carbon (aromatic or delocalized)
  2. [#G6] matches carbon and sulfur, so seems to indicate a group in the periodic table
  3. [#X] no idea... (no internet at home yet, so cannot Google either)
The #G syntax can be rewritten in a OR form, and possible the others too. However, I'd rather see the CDK SMARTS engine support these industry adopted extensions.

Conclusion

The CDK shows its power as development kit, and allowed me to hack up the code of the paper on a casual Saturday evening (sitting on the couch next to a fire in our kacheloven with a glass of beer). Writing up this blog was done the next day.

Once the missing SMARTS patterns have been added to the CDK (or proper replacements have been defined), I'll compare the test set results of the paper with the CDK implementation. I probably also convert the test set results from the supplementary information into unit tests (the SI contains SMILES, experimental and predicted values).

Thursday, October 02, 2008

JChemPaint history: CML patches in 1999

There was some talk about the history of chemoinformatics toolkits by Noel and Andrew, which made me wonder on the exact history of Jmol and JChemPaint. Below is the email Christoph dug up from his archives:
X-Mozilla-Status: 1011                                       
X-Mozilla-Status2: 00000000
Message-ID: <372ECD5E.53A49584@ice.mpg.de>
Date: Tue, 04 May 1999 12:35:10 +0200
From: Christoph Steinbeck
Reply-To: steinbeck@ice.mpg.de
Organization: Max-Planck-Institute of Chemical Ecology
X-Mailer: Mozilla 4.51 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Egon Willighagen
Subject: Re: Participating in JChemPaint
References: <000701be9613$34cf52e0$8e74ae83@catv6142.extern.kun.nl>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit



> Egon Willighagen wrote:
>
> Dear Christoph Steinbeck,
>
> Yesterday I visited your site on JChemPaint. I like to contribute some
> of my expertise on
> Java and CML (1).
>
> CML is a markup language that is able to contain chemical information.
> It can contain for example physical properties, for which I use CML in
> my Dictionary on Organic Chemistry (2).
> But is also might contain spectra, bibliographic references etc. And
> of course 2D and 3D
> structural information.
>
> Therefore I propose to write both CML-input and -output procedures for
> the JChemPaint project.
>
> I hope to hear from you soon.
>
> Yours sincerely,
>
> Egon Willighagen
>
> 1. http://www.xml-cml.org/
> 2. http://www.sci.kun.nl/sigma/Chemisch/Woordenboek/

Dear Egon,

thanks very much for your mail and your offer to write CML-input and
output routines for JChemPaint.
That really sounds great to me and I will give you access to our CVS
tree as soon as we have discussed the details.

Cheers,

Chris

--C. S.
Dr. Christoph Steinbeck (http://www.ice.mpg.de/~stein)
MPI of Chemical Ecology, Tatzendpromenade 1a, 07745 Jena, Germany
Tel: +49(0)3641 643644 - MoPho: +49(0)177 8236510 - Fax: +49(0)3641
643665

What is man but that lofty spirit - that sense of enterprise.
.. Kirk, "I, Mudd," stardate 4513.3..

Now, my email must have been triggered by the announcement of JChemPaint on FreshMeat.net, which is the oldest public record of JChemPaint I have found so far:

Who likes my FriendFeed posts most...

Felix has a small tool on his website to show me (or anyone else) who likes what I post on my FriendFeed account:

Which actually is Deepak...

Wednesday, October 01, 2008

Cherry-picking commits from CDK trunk: how to make a reasonable commit message

Some of you heard me complain about commit messages resulting from git cherry-pick which allows me to apply patches from CDK trunk to a branch, without needing to do a full merge of what happens in trunk. The commit messages would be identical, which made it seem that those original messages were mine.

However, this is how I can modify those messages:
    $ git commit --amend
This allows me to convert a mere refactored a method into Applied patch from trunk (rev 12479): [shk3] refactored a method.