Sunday, June 29, 2008

The CDK Community: Developers, Members, and Users

An open source project is as good as its community. Jmol has a brilliant community, but CDK is not doing bad either, in general at least; some CDK projects could use some more user feedback, such as CDK-Taverna (site down at the time of writing, but see the blog).

There are actually quite a few things going on within the CDK Community. There are a few active or less active projects: (Even withing the CDK Library, several threads are ongoing, but I will report on that at some later stage).

A full list of projects is found in SVN. A recent new project is CDK Policy, which will attempt to formalize code development such that the library becomes better maintainable. One of the first things the draft does, is formalize roles withing the community.

CDK Members and CDK Developers
Basically, anyone with write access to CDK's SVN may call himself CDK Developer (57 in total). So, what if you contributed patches? Then you may call yourself CDK Member (well, as suggested in the draft policy), like anyone else who is subscribed to the cdk-devel mailing list.

This is an important fact. Anyone can subscribe to the list, and directly becomes active CDK Member; He/she gets a voice. The policy proposes that difference between member and developer to be in the fact that the policy has been accepted by the person. Therefore, according to the draft policy, anyone who accepts the policy gets SVN write access, making the CDK not just Open Source, but an Open Community too. The policy then organizes the maintainability of the software development.

CDK Users
A CDK User is basically anyone who uses any of the CDK products, but in particular those subscribed to the cdk-user mailing list.

A limited overview of developers, members and users can be found on this Google map:

Just email me (or cdk-user) your latitude/longtitude to have yourself or your research group (URL) linked on this map as user or developer.

Wednesday, June 18, 2008

Tuesday, June 17, 2008

Graphical overview of my bookmarks

Deepak informed me about Wordle via a FriendFeed notice, which can make nice visualisations of tag clouds. Here's the one for my account:

You can clearly see I have quite some reading up to do :)

Thursday, June 05, 2008

Recovering full mass spectra from GC-MS data #2

Steffen reminded me over email that the particular machine only has a 1 dalton accuracy, and that the 150ppm parameter setting is somewhat inappropriate. As seen yesterday, it works fine for larger peaks, but fails for low intensity peaks. So, I reran the centWave peak detection with 750, 1000 and 1250 ppm, and that indeed make XCMS recover many more metabolites, and, also important, with more extracted ion chromatograms per metabolite, yielding a more accurate mass spectrum. At the same time, I notice that profiles are not as clean as before, but that's where the peak fitting with (Modified) Gaussians come into play.

The original 150ppm results:

The 750ppm results:

And for 1000ppm (1250ppm did not further improve):

Wednesday, June 04, 2008

Recovering full mass spectra from GC-MS data

One aspect not covered in detail by the ongoing discussion on unit testing quality control for scientific software, is detecting regressions. This the most important reason why unit testing is superior to random testing. Putting someone behind a keyboard to tests things is nice, but this process has to be repeated, as the testing has to be repeated over and over again. Just to make sure it works for whatever new input, for whatever refactoring, for whatever new cool feature.

Another advantage of unit testing over random testing, is in the fact that it provides you with statistics (lies, damn lies, and statistics). These statistics do give some insight where to start looking, though if really written properly, each unit test has the ability to test a single line of functional code. That's where code coverage testing is useful, and should be part of the process too. I have no idea what commercial chemoinformatics software vendors do regarding quality control, but I assume they make heavily use of code coverage too.

SimBioSys Blog mentioned the use of annual software competitions. That's important indeed, and provides nice means to compare options, but it is not unit testing; it's a macroscopic test of functionality, and has little means to identifying underlying fails. Is a bad CASP score caused by wrong isotope masses used in the force field, or by the approach? I'm sure no one can tell.

Anyway, refactoring is a principle activity of software engineers, and unit testing manages that process in some way. Taking unit testing to the extreme, any new coding starts with writing the unit tests for the new API. Only when those are finished, the functionality is actually implemented.

Ok, now something less boring. GC-MS-based Metabolomics with metabolite identity in particular. Though I believe there are other uses too, for example, in between sample alignment, the recovery of a full mass spectrum is particularly important for metabolite identification of new, yet unknown compounds (yes, even dereplication is already non-trivial, because of the lack of free (open data preferably), machine accessible (open standards!) database of mass spectra (using different ionization methods). Look up by monoisotopic mass is possible in, for example, ChemSpider (see this blog), but look up via full spectrum is less common. The number of databases are growing, and likewise the openness and accessibility. Who know what the Netherlands Metabolomics Center's support platform will be able to offer in a year or two.

Now, unit tests could, for example, tests that some algorithm can deconvolute the following GC-MS data:

The red line is the TIC for the chromatogram, while the black lines are the extracted ion chromatograms of individual m/z (ion) peaks in the mass spectral dimension, and, only of peaks detected using XCMS using the new centWave method by Ralf. Now, the results are not perfect in this diagram, but it does seem to recognize all five eluting metabolites (that's the amount I would guess are eluting). However, I am more interested in the methods ability to recover all m/z peaks for each metabolite, to allow me to identify the structure, or at least make a best possible educated guess, or, with a bit of luck the compound is already known, I can dereplicate it against some database (Guesses differ, but, particularly in plant metabolomics, more than half of the metabolites we can now detect have a yet undetermined structure).

Now, I'm sure any method will be able to deconvolute these compounds. They are well separated, show a nice gaussian shape, and deconvolution based on just the chromatographic domain will likely already work. It starts to become more difficult for the low intensity peaks, those with low signal-to-noise ratios, or those with a different elution profile (e.g. peak tailing). Deconvolution typically requires some peak shape (commonly Gaussian or Exponential-Modified Gaussian), while experimental data typically does not have that. Jim Downing recently introduced me to the term 'Long Tail Science' (via this blog from Peter):
    Jim Downing cam up with the idea of “Long Tail Science”. The Long Tail is the observation that in the modern web the tail of the distribution is often more important than the few large players. Large numbers of small units is an important concept. And it’s complimentary and complementary.
This long tail of ion chromatograms is what I am interested. I do not care about the usual suspects, I want to learn about the 80% unknown metabolites that are found in samples. What can we learn about those? What is in the long tail of detected metabolites?

Now, I know I am not an expert in tuning centWave parameters, so I might as well be passing garbage in, but the method is not robust against me:
xr <- read(file="someClosedData.cdf")
p <- findPeaks(xr, method="centWave", ppm=150, peakwidth=c(5,25))
But it fails to detect some of the metabolites in the long tail:

As you can observe from the low S/N ratio on the red TIC line, you can notice that we are at low intensity metabolites. Assuming some peak shape is at such noise levels much more difficult than with better S/N ratios. The centWave uses a really nice non-parametric approach here... well, not entirely, otherwise it would not have failed over my parameter settings :) Steffen/Ralf, what was the DOI again?

Now, I found these missed metabolites by manual browsing the data, data exploration. SBS wrote [t]here are four distinct type of tests: Func, Speed, Error and Robust. I believe the above situation is really a fifth class: it's neither a true functional test, but it is not an true robustness test either. The input is valid, but of such that it moves towards invalid input. Scientific data is a continuous spectrum of input (remember the oil in, oil out). Software must be tested against such borderline data; repeatedly, over and over again, for any version, for any code change, for any platform.

Oh, and code quality has nothing to do with trust. Give me statistics (I can interpret the scope of them) over any trust assurance.

Tuesday, June 03, 2008

Good Scientists Pimp there Research (was: Damn, I'm boring...)

Define good. Let me say that up front. Good scientists, that is, if you say successful researchers are good scientists, secure good funding. Getting good funding requires doing the most relevant research (define relevant). Or, to put it bluntly, being a successful researcher requires to pimp your research. Doing boring research is nice for you, good for a Nobel prize if it turns out to have a cool spin off, but doesn't buy you research success.

And, boy, am I boring or what? Really, Peter is right... I am boring. He is also right that doing quality control doesn't result in research output. He writer:
    But it’s essential for modern knowledge-driven science. The chemical software and data industry has no tradition of quality. I’ve known it for 30 years and I’ve never seen a commercial company output quality metrics. I have never seen a commercial company publish results of roundtripping. That’s another really boring and apparently pointless operation where you take a file A, convert it to B and then convert it back to A’. What’s the point? Well A and A’ should be the same. With most commercial software you get loss. If you are lucky it’s only whitespace. But it’s more likely to be hydrogens or charges or whatever.
There is a huge gap between should and is. This is partly caused that it is much easier to throw garbage at some piece of code, than to start an Ugi reaction on something which is naturally broken. Take a moment to think about that. Garbage in, garbage out is still common in information technology, but I have never heard of the saying oil in, oil out.

Scientists do not have the ability to recognize garbage (when it comes to data)! It's statistics, that other boring stuff I do. Even worse, it's the programmers fault if someone throws garbage at my software and it doesn't return 42.

Unit testing is a poor mans approach to handle garbage. As are diagnostics tools, like finding the difference between IChemObjects (and #2). Unit tests define common garbage in (of course, also valid input in, but that does not make it less boring) and tests that the method does not freak out; that the tested method gives the user some friendly message like "Hi there! I'm your friendly computer program. I am sorry to inform you that it's garbage you just passed along. Please clean it up first.".

Anyway, back to my boring research...

Finding differences between IChemObjects #2

CDK QSAR descriptors are not allowed to change the input [molecule|atom|bond], and I recently added a unit tests (rev 11138) for that to the abstract class AtomicDescriptorTest.

After some code clean up of the diff module code earlier this morning (in anticipation of the rain stopping), I applied this patch (rev 11269) that noModification unit test:
 public void testCalculate_NoModifications() throws Exception {
IAtomContainer mol = someoneBringMeSomeWater();
IAtom atom = mol.getAtom(1);
- String priorString = atom.toString();
+ IAtom clone = (IAtom)mol.getAtom(1).clone();
descriptor.calculate(atom, mol);
- String afterString = atom.toString();
+ String diff = AtomDiff.diff(clone, atom);
- "The descriptor must not change the passed bond in any respect.",
- priorString, afterString
+ "The descriptor must not change the passed bond in any respect, but found this diff: " + diff,
+ 0, diff.length()
This is a nice example of where the new diff module is useful. Instead of dumping to long IAtom.toString()s, the output now gives output like:
AtomDiff(AtomTypeDiff(, NULL/H, NC:0/1, V:0/1))
This indicates (yes, a bit cryptic) that the formal neighbor count (NC) and the valence (V) fields have been modified, in addition to that first field, which I don't know what it refers too. Indeed, the output still needs a bit more tuning :)

Sunday, June 01, 2008

Finding differences between IChemObjects

CDK trunk is getting into shape, thanx to the many people who contribute to this, and special thanx to Miguel for cleaning up his code related to charge, resonance, and ionization potential calculations!

At the moment, I am focusing at two issues:
  1. QSAR descriptors that change the input (causing other descriptors to randomly fail)
  2. Cloning of IChemObject (for which last week a rather serious bug was found)
Until some days ago, the CDK had one main method to introspect IChemObjects: their toString() results. However, finding the difference between IChemObjects using this approach is not trivial, particularly if there are several differences.

So, I started a new module called diff. If two objects are identical, it returns a zero-length String. If not, it lists the changes between the two classes, in a way much like that of the IChemObjects toString() methods.

For example, consider this bit of code:
IChemObject atom1 = new ChemObject();
IChemObject atom2 = new ChemObject();
atom2.setFlag(CDKConstants.ISAROMATIC, true);
String result = ChemObjectDiff.diff( atom1, atom2 );
The result value then looks like:
ChemObjectDiff(, flag5:F/T)
Now, output will likely change a bit over time. But at least, I now have a easier to use approach for debugging and writing unit tests. Don't be suprised to see test-* modules start depending on the new diff module.