Wednesday, February 27, 2008

Where's the maize genome torrent?!?

/. just posted a story about the maize genome just published, for which the sequences can be downloaded from this FTP site. The files are not that large at all. But it makes me wonder... where are the .torrent files for the sequenced genomes? Here's Davids catch on the story.

Update: OpenHelix discusses the matching genome browser, and indicates that hundreds of genomes are actively being studied. The group where I work recently bought a 454 (reg needed, it seems) and participates in the race. Nibbles links the cork event to square peas... but my biological background prohibits me to see that link...

CDK is now available from your nearest Debian mirror

Some days have passed, and the Debian mirrors have now picked up the CDK package (unstable only so far), allowing you to sudo aptitude install libcdk-java from your favorite local mirror. The details are available from this page. The fact that it is listed as contrib is a small mistake; the package is really main material.

Now, also make sure to install BeanShell (sudo aptitude install bsh), which allows you to start scripting the CDK. For example, consider this simple script:
import org.openscience.cdk.Atom;
Atom atom = new Atom("C");
Save this as content of a file simpleExample.bsh, and run the bsh program to run the script. You will have to set the CLASSPATH, so the full command looks like this on my Linux desktop:
CLASSPATH=/usr/share/java/cdk-interfaces.jar:/usr/share/java/cdk-core.jar:/usr/share/java/cdk-data.jar:/usr/share/java/vecmath1.2-1.14.jar bsh simpleExample.bsh
A wrapper script cdkbsh that adds the CLASSPATH seems desirable here :) But you get the point.

Interestingly, BeanShell also comes with a graphical user interface, as well as a command line based scripting environment. Both make perfect set ups for quickly testing some code. The GUI version xbsh looks like (don't forget to set the CLASSPATH):

Wednesday, February 20, 2008

CDK close to entering Debian

Michael Koch (aka man-di) and Daniel Leidert (as part of the pkg-java team) have worked on packaging the CDK. The ran into some issues, such as the CDK build system not perfectly compatible with the Debian java libraries in /usr/share/java. Both detection of the available libraries as well as putting them in the classpath, caused trouble with the CDBS-based build system wrapping around the Ant build.xml (note the many commit this weekend ;).

The result is noteworthy: CDK has entered the Debian NEW queue. This means that the Debian experts will check that CDK is really ready to enter Debian. Licenses will be checked, for example. This has been one of my long standing wishes, and I am happy that Michael got around to getting things done. Cheers!

Wednesday, February 06, 2008

Simple, Open Bug Track System: social bookmarking

Jim replied to the request by Anthony in my blog for a bug track system for CrystalEye (in beta), after a discussion on the CIF processing pipeline (see here, here, here and here).

Instead of setting up a BTS at SourceForge, locally with Buzilla, or at LaunchPad, he suggested to use Connotea:
    To report a problem in CrystalEye, simply bookmark an example of the problem with the tag “crystaleyeproblem”, using the Description field to describe the problem. All the problems will appear on the tag feed.

    When we fix the problem we’ll add the tag “crystaleyefixed” to the same bookmark. If you subscribe to this feed, you’ll know to remove the crystaleyeproblem tag.

    In the fullness of time, we’re planning to use connotea tags to annotate structures where full processing hasn’t been possible (uncalculatable bond orders, charges etc).
Now, Connotea is advertised as a [f]ree online reference management for all researchers, clinicians and scientists, and I have never really been happy with any HTML page ending up in the system, I would counter the suggestion by using social bookmarking websites for any HTML page (not just publications), such as (see their list of CrystalEye bookmarks).

Anyway, it does not really matter, and Connotea has an open API to query the database. This will allow Jim to write a simple userscript to enhance each CrystalEye page with a list of bug reports. That will allow every CrystalEye visitor to see what others are commenting on it. In that respect, many other things can be envisioned... Getting comments on the paper behind the crystal structure from Chemical blogspace and Postgenomic, ...

Tuesday, February 05, 2008

Performance: C, C++, C#, Java, Perl and Python

Mathieu Fourment (et al.) just published a paper on some performance testing on 6 programming languages in BMC Bioinformatics: A comparison of common programming languages used in bioinformatics (doi:10.1186/1471-2105-9-82). The below figure is from the paper, for a sequence alignment exercise (copyright with paper authors, OpenAccess license of journal):

Nothing shocking, I'd say; Java is similar in performance to C++.

What I'd love to have seen, was the performance of compiled Java too, using the java compiler (gcj) which comes with GCC 4.1.1. No idea why that was left out. One could also question why they did no use the 1.6 JVM of Sun, which is more faster (see these results on running the CDK unit tests). And, a major omission is Fortran.

Anyway, the authors provide the source code, so we can easily test ourselves the effects of that.

BTW, first post? :) update: At least I beat Carlos.

Saturday, February 02, 2008

Defining Development Goals: LaunchPad complements SourceForge

Today, Miguel (who made the 10000th CDK commit) and I gave LaunchPad a go, because if offers a nice GUI for planning and monitoring source code development. We have set up a CDK team and a CDK project. LaunchPad has overlap with SourceForge functionality, but they idea is not to duplicate functionality. Moreover, we do not translate the CDK either, so that LaunchPad functionality is not useful either. Not for the CDK at least; maybe for Jmol and Bioclipse?

However, we are interested in the task management system of LaunchPad. While the CDK project is currently maintaining a Project Maintenance Tasks tracker, it does not have the feature richness of the LaunchPad equivalent. The latter allows us to link tasks with series goals. We currently basically have two series: the cdk1.0.x/ branch, and trunk. Miguel and I have been working on getting the ionization potential prediction in trunk working, which involves about all the code Miguel wrote during his PhD thesis with Christoph. And, this is one of the goal of the next stable CDK series (replacing the 1.0.x series). This is something we can easily define in LaunchPad:

Getting the IP-prediction code updated for the new CDK atom types and other changes, and making it CDK stable involved quite a long list of tasks, which shows dependencies. For example, I can't continue cleaning up the partial charge prediction code, before the resonance structure generator in the reaction module is working properly again. This in turn depends on me adding missing radical and charge atom types, which in turn depends on expected atom types, which Miguel had to implement. And this last is actually what he was committing around the 10000th commit.

Now, Miguel and I will try to manage this development in trunk using LaunchPad. It allows as to define all these smaller tasks, but, more importantly, the dependencies between them:

As such, LaunchPad gives us the means to manage this complex development. It shows up what we're facing, how far we have progressed, and much, much more:

This goes well beyond what SourceForge has to offer; this will be an interesting experiment. I do not anticipate dropping SourceForge at all (just in case you were wondering...); they have served as generally very, very well; and completely free too! (LaunchPad is free too) As far as I can see, they form a perfect complement. Like a ligand and an enzyme, like opensource and open notebook science, or like a Mammoth and an ice field.

Speaking about ONS... Jean-Claude, not sure if LaunchPad would be open to projects without source code too...

10000 CDK commits!

It has happened. Just a few minutes ago. The 10000th commit to the CDK source code repository. Miguel was the lucky(?) one. From our IRC channel #cdk on the network:
[19:55]  cdk: miguelrojasch * r10000 /branches/miguelrojasch/
reaction/src/org/openscience/cdk/ (2 files in 2 dirs): Removed Flags.
They were not used anymore.
And a screenshot:

The first source code was actually only added with the 5th commit, made by Christoph, days after our meeting with Dan (of Jmol fame) at Notre Dame in September 2000.

The full list of people who contributed to this enormous success is provided by OHLOH.