Sunday, November 30, 2008

Parallel building the CDK

Some time ago, I added parallel building targets for CDK's Ant build.xml. Now that I am setting up a Nightly for the jchempaint-primary branch, and really only want to report on the CDK modules control and render, I need the build system to use a properties files to define which modules should be compiled.

So, I hacked a bit on the build system, and made use of two ant-contrib tasks, if and foreach which in the first place reduce the size of the build.xml, but also provide means for parallelization. Earlier, it was using the parallel task of Ant itself for this (see CDK Module dependencies #2).

The build dependencies between CDK modules are fairly complex, and typically this complexity increases upon bug fixing etc. Ideally, the build dependencies will be calculated on runtime, instead of being hard-coded right now, and I will explore this in the near future.

These dependencies can be used to build some of the module in parallel, but not all. This causes speed up of the compilation not to scale linearly with the number of threads or cores. The below build times are calculated for three replicates, on a four core machine:

Going from one to two threads certainly pays of, but going to 4 shows only a three second speed up. The four processor cores were not utilized 100%, so I also attempted 2 threads core, but that showed zero improvement.

Monday, November 24, 2008

Software is a Method (Meme)

  1. it provides a recipe to approach (scientific) questions

  2. let's you cook up a (scientific) answer

  3. you can use it as a black box (like an orbitrap)

  4. you can refine existing methods (well, some can, others don't)

  5. it has an error (but I do not believe it is normally distributed)

Now, to me it's trivial to work put how Open Source supports this.

Thursday, November 20, 2008

Scripting JChemPaint

Today and tomorrow, Stefan, Gilleain, Arvid and I are having a JChemPaint Developers Workshop in Uppsala, to sprint the development of JChemPaint3, for which Niels layed out the foundation already a long time ago.

Gilleain and Arvid are merging their branches into a single code base, while Stefan is working on the Swing application and applet. The Bioclipse SWT-based widget is being developed for Bioclipse2.

The new design separates widget/graphics toolkit specifics from the chemical drawing and editing logic. Regarding the editing functionality, this basically comes down to have a semantically meaningful edit API. This allows us to convert both Swing and SWT mouse events into things like addAtom("C", atom), which would add a carbon to an already existing atom. However, without too much phantasy, it allows adding a scripting language. This is what I have been working on. Right now, the following API is available from the Bioclipse2 JavaScript console (via the jcp namespace, in random order):
  • ICDKMolecule jcp.getModel()
  • IAtom getClosestAtom(Point2d)
  • setModel(ICDKMolecule) (for really fancy things)
  • removeAtom(IAtom)
  • IBond getClosestBond(Point2d)
  • updateView() (all edit command issue this automatically)
  • addAtom(String,Point2d)
  • addAtom(String,IAtom) (which works out coordinates automatically)
  • Point2d newPoint2d(double,double)
  • updateImplicitHydrogenCounts()
  • moveTo(IAtom, Point2d)
  • setSymbol(IAtom,String)
  • setCharge(IAtom,int)
  • setMassNumber(IAtom,int)
  • addBond(IAtom,IAtom)
  • moveTo(IBond,Point2d)
  • setOrder(IBond,IBond.Order)
  • setWedgeType(IBond,int)
  • IBond.Order getOrder(int)
  • zap() (sort of sudo rm -Rf /*)
  • cleanup() (calculate 2D coordinates from scratch)
  • addRing(IAtom,int)
  • addPhenyl(IAtom)
This API (many more method will follow) is not really aimed at the end user, who will simply point and click. The goal of this scripting language is, at least at this moment, to test the underlying implementation using Bioclipse. Future applications, however, may include simple scripts which use some logic to convert the editor content. For example, replacing a t-butyl fragment into a pseudo atom "t-Bu". The key thing to remember, is that this will allow Bioclipse to have non-CDK-based programs act on the JChemPaint editor content (e.g. using getModel() and setModel(ICDKMolecule)). More on that later.

A simple script could look like: Or, as screenshot:

Tuesday, November 18, 2008

Solubility Data in Bioclipse #1

I am working on converting Jean-Claude's Solubility data to RDF (after Pierre's model, see here, here, and here, here for first data exploration), so that I can integrate it with data from DBPedia, Freebase,, etc. Bioclipse will be the workbench in which this will be visualized, and just got graph depiction online using Zest. The screenshot does not show the RDF yet, but that will follow soon:

Next stops:
  1. create a Eclipse package for Jena
  2. read the Solubility data (does anyone know a Java library to read from Google Docs?)
  3. create a virtual database of Solubility compounds (possibly StructureDB-based)
  4. Use the CDK to autoextract chemical triples

Wednesday, November 12, 2008

Re: Open Source != peer review

Andrew has an interesting thread on the content of a slide of a recent presentation. In the comments you can read the back and forth on things; indeed, there are very many aspects to things and he did ask a very complex question, of which he assumed that I understood what he was asking, and I indeed assumed too that I understood what he was asking:
    Some argue that doing good computational-based science requires open source. The argument is that scientists need to review the source code in order to verify that it works correctly. How, they argue, can you review someone else's paper if you can't review the source code used to make that paper?

    I like open source. (My talk goes into the philosophical differences between "open source" and "free software.") I think there should be support for peer review. But I don't understand why the ability to see the source code, in order to review it for scientific quality, requires the right to redistribute the source code to others.
So, I assumed he was interested in hearing why people thing open source benefits open source. Misinterpreting the last two words, I though access to the code and the ability to redistribute code I find bad in my peer review. There was another incorrect assumption on my side: I had open peer review in mind, as I like so much about open source projects, instead of the peer review as in paper peer review, prior to the preprint server age. Another thing I understood incorrectly, was that he was only referring to computational packages, not cheminformatics in general. My mistake. Being from a GCC meeting, I assumed the latter.

Therefore, a lot of miscommunication. I agree to a large extend with Andrews analysis: peer review is certainly possible without Open Source. Actually, this matches closely with the discussion between Cathedral versus Bazaar opensource projects (see my post earlier this week). He argues that current opensource (cheminformatics) do not have enough eyeballs, and indicates that money buys eyeballs. Indeed it does.

However, the original argument I wanted to make, but failed, is that Open Source (any kind of access to the source code) is a strict requirement for reviewing the implementation. We do not want black boxes.

How you organize this access to the source code is another thing, and topic of much of the discussion in Andrews blog. There are many solutions, but all include some sort of access to the source code. Redistribution is not a requirement, though, if the review is only send upstream, as is common in reviewing papers.

I feel that Open Source is a solution worth fighting for, but I do understand the argument that funding of this approach remains to be a problem. Open Source cheminformatics is the equivalent of a preprint server; one solution to peer review, a good one, I think, not the only one. The parallels are seemingly even stronger: you cannot review a paper by just reading the abstract and the conclusion: a paper is not a black box either.

Anyway... just a tip of the iceberg touched in the discussion. Feel free to join in.

Monday, November 10, 2008

Finding the commit that causes the regressions...

CDK 1.1.x releases are well in progress, but a recent commit broke a number of unit tests. Here comes git-bisect.
$ git checkout -b my-local-1.2 cdk1.2.x
$ git bisect start
$ git bisect bad
$ git bisect good 8219139e9236ab8036e9d08c13fcd0482d500c79
These lines indicate that the current version (HEAD) is broken, and that revision 8219139e9236ab8036e9d08c13fcd0482d500c79 was OK. Now, git-bisect does the proper thing, and starts in the middle, allowing me to run my tests, and issue a git bisect bad or git bisect good depending on whether my test fails or not. The test I am running is:
$ ant clean dist-all test-dist-all jarTestdata
$ ant -Dmodule=smarts test-module
$ git bisect [good|bad]
So, if I had to inspect 1024 commits, I'd found the bad commit in 10 times running this test suite. For the culprit I was after it was 6 times. The outcome was this commit, what I already suspected and emailed about to the cdk-devel mailing list:
[fa49ac603c36908f341b25d52a78435cdb8ca4d3] atomicNumber set as default (Integer) CDKConstants.UNSET

Friday, November 07, 2008

Open{Data|Source|Standards} is not enough: we need Open Projects

The Blue Obelisk mantra ODOSOS, Open Data, Open Source, Open Standards, is well known, and much cited too. Jean-Claude Bradley popularized the Open Notebook Science (ONS). This has always been nagging me a bit, because the CDK, Jmol, JChemPaint and other chemistry projects have done that for much longer, though we did not use notebooks as much, so called it just an open source project. It really is no different, IMO, though surely, there are differences.

Anyway, the key thing which ONS and CDK and Jmol share, is that they use an Open Notebook. Not every Open Source or Open Data project does. Actually, many scientific Open Source are not open Projects! They are more like the Cathedral than the wished-for Bazaar (see The Cathedral and the Bazaar). So, Open Source (science) projects are certainly not ONS projects by default!

Now, the CDK actually is ONS, it is a Bazaar. The notebooks we use include: What more would you wish for? That's not a rhetorical question. Remember that every reader of this blog is in my advisory board!

Unfortunately, I do not create work at a workbench myself, so I do not produce new knowledge myself, other than extracted from existing data. That's really a shame, and I really do hope that Jean-Claude or Cameron will send me a box to measure solubilities (see here, here, and here, here for first data exploration), even though I cannot participate in the challenge. (hint, hint :)

From Cathedral to Bazaar in Life Sciences
One Cathedral we ran into with Bioclipse was BioCatalogue, which will serve as website where people can annotate and categorize (web) services. While the project has been around for a while, the website was rather uninformative. Fortunately, the projects is going to open up, and be more Bazaar-like. For example, they now started a wiki and a mailing list. I hope these efforts will continue, so that I can contribute from my point of view!

The EMBRACE Registry is a project with similar goals and a rather nice outcome (which I learned about on Monday). It is actually anticipate to be replaced by or merge with BioCatalogue. So, all data I entered, cheminformatics workflows (look, no 'o'), will later be available from BioCatalogue too. That is already my first contribution to BioCatalogue. One enormously interesting feature of the Registry, is that is allows uploading of code to test the service. This will mean the Registry will not only poll if the service is still online (by checking the WSDL file), it will also test if the service behaves properly. Now, immediate thoughts are mashups with MyExperiment. Each WSDL entry in the Registry points to MyExperiment workflows that use them, and the workflow page would indicate the status of all used WDSL services. This integration was already anticipated long before I thought about it, as the involved Cathedrals were nicely located in the same floor in Manchester.

Below is a screenshot from the EMBRACE Registry for the ChemSpider WDSL entry for a workspace I uploaded about a year ago to MyExperiment:

BTW, ChemSpider has an Advisory Board of which I am member, but it is also a classical (and intentional) Cathedral project. We do share common interests though, which makes us collaborate.

Why Important?
One recurrent theme in Open Source is given enough eyeballs, all bugs are shallow. This surely applies to science as well. The difference between the two is that in current science the eyes only inspect with a delay of at least 6 months. Current practice is that research is finished (delay), and when decided publishable written up a paper (delay, and loosing valuable information in the process, as you can read in my blog all the time), and published (even more delay). ONS changes that, and so do Bazaar-like open source projects, such as the CDK, Jmol and Bioclipse. They bugs are present, whether we like it or not, not just in source code, but in science too. Theories get overthrown, but why should we like the long delays current scientific good practice? Hate it! Work around it. Use the Bazaar. Use ONS!

Now, ONS actually needs Open Source, allowing them to deal effectively with the data they produce; to allow extraction of new scientific knowledge from the measurements. If Rajarshi and Pierre would not have made their efforts, other could not easily join in, leading to those much hated delays. Bugs should be shallow, and openness allows us to make those bugs visible. We can prove that there is a bug, without having to reproduce data ourselves, leading to those nasty delays again. Just copy the data, compare it to your own, do your analysis.

One recent project in open source chemistry dealing with making bugs visible, is the web page set up by Andreas Tille for the DebiChem project. His page summarizes the bugs listed for the chemistry in Debian (which includes the Blue Obelisk projects Avogadro, BODR, CDK, Chemical MIME Data, Kalzium and OpenBabel):

This data analysis helps the projects being analyzed.

This brings me to a last topic, for this blog: packaging using Open Standards. In order to allow those eyeballs to spot bugs, it is of the utmost importance to package your results in Open Standards, and not just one, but likely many. For Open Source projects this ultimately means Distribution Packages (deb or rpm). If that goal has been achieved, you know your results can be read by anyone. Software should be installable (make, ant, cmake, etc), and Data should be readable (no PDF, but RDF, XML, JSON, or whatever standard). Preferably not Excel, as this is too free format (as Rajarshi also indicated), but with some added conventions it may do well. Blue Obelisk project are generally doing well in terms of packaging.

For the CDK, which already is reasonably well packaged, I am currently working on Eclipse and Maven2 packages. The former is already being used by Bioclipse, while the second aim at Jumbo (which has just seen a new release. Jim, I'm happy to see the CMLDOM/Jumbo split!), CDK-Taverna, and possibly a third (Paula, what for do you plan to use it?). The POM export is not fully working yet, but with four research sites involved in this Open Project, I'm sure we'll work it out.

The bottom line is, scientific progress would benefit so much from a Bazaar approach. And the key thing is not collaboration; that's something you can do in a Cathedral-like fashion too. No, the key thing is to be Open and allow anyone, even your worst nightmare, to comment on what you do. Let him prove you wrong, openly, that is.

OK, there it is. My open notebook entry for this week. Now you know what I have been up to this week.

Tuesday, November 04, 2008

Next generation asynchronous webservices #2

Getting back to some webservice stuff (see part #1 of this series)... actually, I'll use cloud service from now on, since web service is reserved for SOAP/WSDL (see my EMBRACE presentation). Let me present this bit of JavaScript I just ran in Bioclipse2:
service = xws.getService("");
f = service.getFunction("calculateMass");
ios = f.getIoSchemataSync(9000);
iof = xws.getIoFactory(ios);
smiDoc = iof.createSmilesDocument();
result = f.invokeSync(smiDoc.toString(), 9000);
obj = iof.getOutputObject(result);
print("Mass: " + obj.getStringValue())
At first, it might look a bit verbose to just calculate the mass of a molecule, and it is, and it is not even written in XML. Hahahaha

Anyway, the code rocks, thanx to Johannes' great work on his xws4j library! I'll explain the script. The first line gets Bioclipse online using a Jabber account, which you set via Bioclipse' preferences pages. The next few lines allows you to connect to a cloud service, this one running on and called cdk. With the getFunctions() method we query which functions are available, called ports in WDSL if not mistaken, from which we pick the calculateMass one.

And then the action joins in. One nice feature of the IO-DATA proposal is that the function itself defines the XML Schema it uses for input and output, and does not rely on WDSL to do that (or maybe recent SOAP specs allows that too). So, we query the function for its schemata, and the xws4j library then something funky happens: we order the library to create a data model on the fly for this service! From this we get a Java data model for the service. This allows us to use createSmilesDocument() and setSmiles(). That's function-specific stuff!

Of course, we do not have to do that. For example, the second function I wrote (generate3Dcoordinates) eats and spits CML, and I'd rather rely on CMLDOM or CDK as data model then. But more on that later...

The Bioclipse xws4j plugin actually puts the data model in my workspace, so that I can easily introspect the API:

The last three lines invoke the function (synchronously, as it's cheap), and get the mass from the function output. BTW, I should stress that a function does not require any specific implementation regarding synchronous or asynchronous calls. You write one function, and can call it in either way you like. The library hides all IO-DATA details around that.

Monday, November 03, 2008

EMBRACE workshop in Uppsala

This Monday and Tuesday I will attend the EMBRACE workshop Understanding, creating and deploying EMBRACE compliant WebServices. I will present there the ongoing work in Bioclipse to support services and web services in particular. The sheets of the presentation will look like: