Maintaining patches is fixing patches

Today I had a question about having to fix patches against upstream changes because those patches were not included upstream yet is not very productive.

However, it is a prominent part of maintaining a code base. In the past 9 year, I and many others have been reworking a lot of CDK code because of API changes and bug fixes in deeper parts of the CDK library. At least half of the work I have done for the CDK is doing this kind of fixing of downstream code. This is never trivial, and it is never productive. Well, depends somewhat on your definition of productivity.

Whether productive or not, it is just something that needs to happen. Additionally, it is not something you can prevent. I guess one can call this a fact of life. Doesn't make it nice work. Not at all. And most of my frustration with the CDK library is the lack of documentation and unit testing, which makes such fixing of downstream code hard. This means that the person best suited to do this job, is the one who wrote the patch in the first place. The person who made the comment I mentioned earlier is seeing this from very up close now.

Code Quality
I very much understand his feeling of being unproductive when updating patches; been there, done that. He (that I can disclose) is absolutely right. With all the quality assurance functionality I have set up in the past for the CDK, nicely integrated in Rajarshi's Nightly script, I hope to make it easier for people to write proper maintainable patches. Often these reports are, however, again about doing tasks which make you feel unproductive. But I can assure you that writing such tools quality assurance tools, like the OpenJavaDocCheck I worked on this weekend, makes you feel even less productive.

Sometimes making a library better maintainable, includes reworking the design. Almost always this take serious effort, and potentially introduce new bugs. At the same time, it always fixes a lot of older bugs and at the same time, of redesigned properly, makes it much easier to fix other bugs and allow more functionality to be implemented.

But again, this requires rewriting of downstream patches too. And the one doing the redesign will always get comments about this requiring to make unproductive code updates downstream. I have seen this on several occasions in the CDK, such as my rewrite of the atom typing functionality in the CDK. (And don't get any KDE4 developer started on that topic ;) Another fact of life, I guess.

CrossRef writes up RSS usage recommendations

CrossTech announced that a CrossRef working group has written a best practices for the use of RSS feeds by publishers. Nice introduction for anyone who is creating RSS feeds. Only comment I could make, is the lack of other modules. For example, a Chemistry module has been proposed by us 5 years ago already (DOI:10.1021/ci034244p) and about which I blogged on several occasions.

Below is the CMLRSS feed of Chemical blogspace.

Of course, publishers can take advantage of such modules, using the XML Namespaces technology. The best practices uses that for a Dublin Core and a PRISM extension. The here discussed CML extension is another one, but the point is, that you can basically plug in any module.

Work in Progress: an Open DocCheck replacement

While it is still very much in progress, I have already made more progress than I had hoped for. The JavaDoc Doclet API is actually not too difficult to use, though my use will very likely improve more later. The CDK has been using Sun's DocCheck utility for testing the library's JavaDoc quality, but the reports never really satisfied me. Moreover, the most recent version is ancient and because it is closed source, no one can continue on those efforts. DocCheck is MIA.

Instead, PMD is given nice overviews of what it believes to be wrong with the CDK, and also provides a decent XML format which allows extraction of information, which is used by, for example, SuperNightly as showed yesterday in PMD 2.4.5 installed in the CDK 1.2.x branch.

I have been pondering about it for a long time now, but writing a JavaDoc checking library is hardly core cheminformatics research; at least, you would not get funding for it, despite everyone always complaining about good documentation. Alas.

Last week, I was reviewing some more code, and again saw the very common error of the missing period at the end of the first sentence in JavaDoc. This one is sort of important for proper JavaDoc documentation generation, but the complexity of the current DocCheck reporting, people are not familiar enough with it. Being tired of having to repeat myself, I decided to address the problen, but creating better Nightly error reporting for the CDK JavaDoc.

So, I started OpenJavaDocCheck, or ojdcheck. As mentioned, I have made quite promising progress, and the current version provides the ability to write custom tests (which I plan to use for validating content of CDK taglet content), and create XML as well as XHTML which can be saved to any file. To give you a glimps of where things are going, here's a screenshot of the current XHTML output:

The current list of tests is really small, and consists of a single test:
  • test if each class and method has JavaDoc

PMD 2.4.5 installed in the CDK 1.2.x branch

Today I installed PMD 4.2.5 in the CDK 1.2.x branch which contains mostly bug fixes compared to the 4.2.2 version we had earlier. Several of these include false positives: warnings which were not really problems, but tests going bad.

The number of these false positives seems to be significant as the number of PMD violations for the CDK 1.2.x branch seems to have dropped about 1500! warnings :)

SPARQL end points, Jena and bif:contains

I have been having fun with SPARQL in Bioclipse for a while now, and blogged at several occasions:
One thing I had not been able to work out, is that Virtuoso uses a (rather nice) bif:contains extension that support indexing. However, Jena would complain with:
com.hp.hpl.jena.query.QueryParseException: Line 1, column 31: Unresolved
prefixed name: bif:contains
Defining the prefix did not solve the problem either, but Ivan Mikhailov just replied to my post to the virtuoso-user mailing list providing the solution.

The solution is in the fact that bif: is in its own namespace, which makes it possible to replace bif:contains by its full reference <bif:contains>. I directly gave that a try in Bioclipse, and just succesfull ran this Bioclipse script snippet:
  "SELECT * WHERE {?s ?p ?o . ?o <bif:contains> \"aspirin\" .};"

Thanx, Ivan!

NMRShiftDB RDF #3: Bio2RDF

My might have seen my efforts to convert the NMRShiftDB data into RDF:
Peter Ansell has shortly after that copied the data into Bio2RDF, but I had not blogged about that yet. So, here goes. If you have not looked at Bio2RDF yet, this is a good time to do that. The structure of the exposed triples is not perfect, and I just realized I made a beginners mistake, to use a domain name in a namespace I have not control over (bad me). The Virtuoso6 faceted browser allows you to navigate the data in Bio2RDF by molecule (e.g. molecule 234):

And by spectrum too (e.g. spectrum 4735):

Where are the CDK 1.3.1 and 1.2.4 releases ?!?

You might be wondering what is keeping the CDK 1.3.1 and 1.2.4 releases. And right you are. When we look at Supernightly, we get a clue (BTW, I hope the EBI nodes will join soon too):

Studying this table shows the reasons: there are too many regressions, too many failing unit tests. For example, 1.2.4 (while not yet released, called 1.2.3.git) has 50 new failing tests. Now, fair enough, this is mostly because of ioformats not being tested in 1.2.3 and most of the fails caused by a bug in the test, not in the code. But that still leaves 20 other failing tests. Mostly related to known bugs, and for some problems patches are actually available.

These last 22 we also see in the differences between 1.3.0 and 1.3.1 (while not yet released, called 1.3.0.git). That's because the ioformats modules is not tested in that branch either, pending a new merge with the cdk-1.2.x branch.

Wednesday, October 07, 2009 funded research to be OA as of 2010

Happy news from the Swedish Vetenskapsradet (via Coturnix): as of next 2010 all peer reviewed journal papers must be Open Access. I am not yet VR funded, but involved in a few VR grant applications. Not that that really matters, as I am happily publishing OA already.

Keeping my Bioclipse repositories in sync with upstream

Bioclipse is now split up over several Git repositories (and some additional stuff in even more repositories). This has all to do with each repository now having one person acting as point-of-access. This means that I have several repositories checked out, which I need to keep synchronized. Now, I am pretty sure there are many solutions (and suggestions very welcome!), but this is the Bash script I have just written to give me an overview of the state of my repositories, hoping it may be useful to others too:

PLUGINS=`ls -1`

        echo "***************************************************************** $PLUGIN"
        cd $PLUGIN; git fetch origin; git status; cd ..

CDK Molecules in RDF

Yesterday, I finally got around to starting a branch on adding RDF support to the CDK; in particular, write the CDK data model ontology in OWL and serialization to and from RDF using the ontology. The framework is now set up, but I have yet to formalize all bits and pieces of the CDK data model in classes and properties. Just as a preview, here is what a very basic bit of CDK model in RDF looks like (N3 format):
@prefix cdk:     <> .

      a       cdk:Atom ;
      cdk:symbol "C" .

      a       cdk:Molecule ;
      cdk:hasAtom  .
Still rather verbose, but very flexible. I have even been thinking of an XHTML+RDFa writer...

Google Wave Invite: but you need to work on the CDK and the CDKitty robot

I just posted to below email to the cdk-user mailing list. Next Monday, I'll decide.
Hi all,

unless you have not read any news in the last two days, you will have
seen that Google is rolling out a second batch of Google Wave
accounts... I have one invite for someone who wants to co-develop the
CDKitty robot, which adds CDK-based functionality to Google Wave...

The code is at:

If you are interested in the account, please email me offline with:

* how you think you can contribute to the robot
* why you want to do that
* how much time you will have for it

The position is open to anyway, and consider your email an application
to the position :) (and, if you are a student, we could even try to
arrange Uppsala University credit points, if you can work 20 weeks
full time on it).

BTW, existing Google Wave users can invite the robot by adding

Processing the ChEBI MDL SD file with the CDK

Bioclipse has a bug report about browsing the ChEBI SD file in its moltable editor. Some entries make Bioclipse crash (as reported), or just very sluggish as with my Dell superlapcomputer :)

So, I processed the file with a pure CDK 1.2.3 with this small piece of Groovy script:
import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.*;

iterator = new IteratingMDLReader(
  new File("ChEBI_complete.sdf").newReader(),
int i = 0;
boolean hasNext = true;
while (hasNext) {
  long startTime = System.currentTimeMillis();
  hasNext = iterator.hasNext();
  IMolecule mol =
  long endTime = System.currentTimeMillis();
  formula = MolecularFormulaManipulator.getMolecularFormula(mol)
  long time = endTime - startTime;
  if (time > 99)
    println i + ": " + MolecularFormulaManipulator.getString(formula) +
            " (" + endTime + "-" + startTime + "=" + time + " ms)"


This script times reading of all entries and reports all that entries take more than 100 ms to read (in the scripting environment). There are surprising results: H2O takes 50 seconds, phosphate 100 seconds. So, I am quite certain it must be the reading of the metadata, and not the connection table. But, this I will explore in more detail now, hoping to come up with a patch for the CDK to speed up reading of such entries.

The full list of timings:
