Tuesday, November 28, 2006

Code coverage: making sure your code is tested

Recently I discussed JUnit testing from within Eclipse, and blogged at several occasions about it in other situations. I cannot stress enough how useful unit testing is: it adds this extra set of eyeballs to make bugs shallow. And it does that, indeed.

Ensuring that you actually test all the code you write, however, is not easy. A couple of years back I read an article about Hansel, which does code coverage checking, but never got it nicely working for the CDK project. Never looked at that lately, so no idea how the current release would work out. Hansel is an extension of JUnit, and requires hard coding class names, which conflicts with CDK's module setup.

Thomas Kuhn pointed me last week to Emma, which seems a nice tool. It does not require hacking our source, and generates cool HTML:

And even highlights the source code:

BTW, I seem to be in good company: Classpath is using it too.

Below is the command I issued to generate the HTML output. Rajarshi, maybe this can be integrated into Nightly? Note that it only runs the tests for the data module:

ant dist-large dist-test-large
java -cp ~/tmp/emma-2.0.5312/lib/emma.jar emmarun -cp develjar/junit.jar:dist/jar/cdk-svn-20061128.jar:dist/jar/cdk-test-svn-20061128.jar -r html -sp src junit.textui.TestRunner org.openscience.cdk.test.MdataTest

Tuesday, November 14, 2006

German Conference on Chemoinformatics 2006: Day 3

Just some short quites note about the third day (see day 1 and 2). Today's program of the German Conference on Chemoinformatics started with a presentation by Rzepa about his work on a semantic wiki (DOI:10.1021/ci060139e), which might be online here. (He recorded a podcast, but I have not seen it online yet.) I wish I could see the sources of those wiki pages, to see how that system integrates RDF, but at least Jmol is running fine. The presentation by Couch showed the status of the Materials Grid project, and how a guy called AgentX does all the hard work. Ihlenfeldt updated us about the status of PubChem, and mostly on what they had to do to keep the system from dying from its own success, for example using something called minimol. Googling does not seem to help, as that points to a number of things, but not any PubChem webpage. I am still waiting for a European organization to set up a mirror.

After the coffee break, Kuhn showed a coarse grained force field, approximating molecules by hacking them up in fragment of 3-10 heavy atoms. I guess, a bit like some small molecules force fields do for methyls. Fragments within a molecule are tied together by springs, and intra- and intermolecular force field parameters by running MD runs on fragment pairs. Varnek argued that QSPR for melting point prediction has reached a fundamental limited, with an RMSE of around 30 to 40 degrees Celsius, which makes it quite unreasonable to decide whether a compound with a predicted melting point of 40 degrees is solid or fluid at room temperature.

You have to forgive me for not reporting on the afternoon session; I was tied up talking with people at our booth, talking about the CDK, Taverna, Bioclipse, Jmol, other opensource chemoinformatics tools, and chemoinformatics in general. Very nice, but exhausting. I might advise the organization to set up a blog aggregator next year, though I am not sure whether there are others blogging about this conference.

Monday, November 13, 2006

German Conference on Chemoinformatics 2006: Day 1 and 2

The 2nd German Conference on Chemoinformatics started yesterday, with two chemoinformatics tutorials: one on industrial chemoinformatics (I saw this presentation before... not sure when), with a good overview on integrating different information sources; the second one was about opensource chemoinformatics by Christoph Steinbeck (being involved in opensource chemoinformatics for almost 10 years now!), which included a Bioclipse demo (by me) and a demo by Thomas Kuhn on the CDK based chemoinformatics plugin to Taverna. Other opensource projects of the Blue Obelisk movement were mentioned and a few outside it too.

The conference is in honor of the life work by Prof. Gasteiger, who gave an overview of chemoinformatics in his group, Germany and Europe. He stressed the need of education in chemoinformatics, like in Obernai. He also highlighted that we, today, are still solving the same problem as 30 years ago. Which is true, which is why this channel is called Chem-bla-ics, trying to solve that problem. When asked if opensource chemoinformatics form the start would have addressed this, he replied that he requires people to cooperatively do research with his group; opensource clearly cannot enforce that.

Day 2

Todays program had a number of interesting presentations (I, unfortunately, missed the first presentation, so have to visit that group soon now, to make up for that.) Prof. Aires-de-Sousa showed his work on MOLMAP for mapping metabolic networks (KEGG really, see my earlier blog), and showed, just as proof of principle, classification of organisms based on this.

J. Weisser talked about docking, still an obligatory topic. This work really showed two new approaches: the use of QM partial charges (the example showed an improvement in RMSD of a factor 10, not very statistical, but promising indeed); the second was the fact that water does not like to be in tight spots, because of reduced possibilities for hydrogen bonding. A concept common in understand supramolecular phenomenon, but I have not seen this applied to docking before. But I am no expert in that field. M. Wagner showed work on using KEGG data to estimate likely metabolites, and the use in reducing effects of metabolic degradation. T. Schroeter introduced me to gaussian processes, a new data modeling method. Quite embarrassing to get introduced to such, as being specialized in modeling methods for chemical problems.

The poster session was, as normally, really exhausting, talking to a lot of people. Having a booth at the exhibition on opensource chemoinformatics added a nice twist to this. I therefore skipped the FIZ-award winner lectures, so I hope someone else will blog about those.

One last note: Sun started releasing their Java platform under the GPL license. Jim, seems that they proved me wrong. The class library is still not GPL, but is expected to become licensed such somewhere in the first half of next year.

Sunday, November 12, 2006

Organic chemists can now tune properties without changing the molecular structure??

Paul Bracher and Joshua Finkelstein pointed my attention to a nice discussion in Nature on the future of chemistry, in What Chemists Want to Know, by Philip Ball. Paul and Joshua already reviewed it thoroughly, but I could not resist commenting in it too. Having chosen chemistry as specialization when I went to university, and with a minor in supramolecular chemistry, this is a something I do relate to.

A main theme is whether chemistry is unexplored enough to justify further academic research and education. Ball's answer is yes, and came up with a six questions, of which I found this one most intriguing: what is the chemical basis of thought and memory. But the article interestingly also discusses if chemistry has not become a tool for more interesting fields of research. The Nobel prize winners Ball interviewed do not think so.

One quote took my surprise: Where is synthetic astronomy - changing the gravitational constant to see what effect that has on the properties of the Universe, and thus perhaps improving it? Well, I might be out of the synthetic organic chemistry for too long now, but this is not a quote I would like to be in Nature with; is synthetic chemistry now able, then, to modify the nature, strengths of bonds now?? can they actually change molecular properties without changing the connectivity?? Moreover, astronomers have changed the properties of objects in our universe: since years they have been reducing the mass of the earth by sending of probes to other objects (satellites etc). Likewise, chemistry is not changing nature, it is just exploring all compounds we never had purified in our glassware yet. Synthesis is nowhere like changing nature.

There is one other comment I would like to post here. I strongly agree that chemistry in itself is important to have as separate educational and research topic at universities. Simply because too databases are, from a chemical point of view, messed up. For example, KEGG and the PDB are know to have many chemical errors, though these databases are rather important indeed. We need people around to educate people and point out those errors, if life sciences itself is to have a future.

Tuesday, November 07, 2006

When is open source chemoinformatics successfull?

Open source chemoinformatics has become a common phenomenon, though many projects are small in nature: source code is developed by only few developers, or even in a closed manner and released when considered done. Within open source software there is room for distinguishing a subset of open development chemoinformatics, that is, Bazar-like, instead of Cathedral-like (see ESR famous writing).

Measuring the importance of an open source project can be done by many measures, such as the number of people on the user and developers mailing lists, number of downloads, number of source lines of code [wp:SLOC], number of independent development locations, and rankings on, for example, SourceForge or Google. Just to name a few.

Scientific importance of an open source project can sometimes be measured by a citation index; that is, only when there is a landmark article for the project. Rasmol is such a project: a first article was published in 1995 (DOI:10.1016/S0968-0004(00)89080-5), and a follow up in 2000 (DOI:10.1016/S0968-0004(00)01606-6). The first was cited 1190 times, and the second 65 times (as stated on Web-of-Science). Quite successful indeed.

OK, it is not even 100+, but I am quite happy with the scientific impact of the CDK so far: the 2003 CDK article (DOI:10.1021/ci025584y) was cited 24 times now, and the just published 2006 article (DOI:10.2174/138161206777585274) once:

Friday, November 03, 2006

Chemical Blogspace updates

Chemical Blogspace is up and running fine for some time now. Since the start the number of aggregated blogs increased from 19 to 64 now, of which a number are situated at ChemBlogs which is a site where you can run a blog. Meanwhile, the number of cited papers went up to 186! The JACS is most popular so far, followed by the Angewandte Chemie Int. Ed.

As mentioned before, the software was taken, which has upgraded considerably and released new software since the author moved to Nature, but I have not found time to follow that upgrade yet :( The promised InChI support is still pending too.

Bioclipse Workshop: short but productive

The Bioclipse Workshop has ended and, for just three days, turned out quite productive. We have first bits of scripting support for JavaScript using Rhino. At this moment the scripting plugin needs to explicit depend on plugins to be able to access their classpath, but we plan to solve that. An example script:
// to have short identifiers
Array =;
String =;
msgBox =;
DbfetchServiceServiceLocator =;

// get data
service = new DbfetchServiceServiceLocator();
strarray = service.getUrnDbfetch().fetchData("refseq:NM_210721", "refseq", "raw");

// make readable
str = new String();
for (i = 0; i < Array.getLength(strarray); i++) {
if (i != 0)
str = str + ("\n");
str = str + strarray[i];

// show

It's just a short example that uses webservice technology in Bioclipse to fetch a sequence.

QSAR support

QSAR support is getting along too, with a new DescriptorProvider extension point in trunk/ and work is progressing on a wizard that allows selecting descriptors and a CDK backend. The output of the wizard is a matrix resource, for which we already have a rich editor. A JOELib plugin has been suggested, as it has a good deal of QSAR descriptors too; Jörg, interested in doing a tiny bit of Bioclipse hacking?

A full proceedings is available online.

Wednesday, November 01, 2006

The Bioclipse Workshop is in progress

The Bioclipse Workshop is in progress, and Ola is now leading a discussion about future releases and functionality. Proceedings are live updated, and presentation sheets will be available shortly.