## Wednesday, September 23, 2009

### Extension point for running JUnit tests in a RCP Application instance?

One thing that has been on my wishlist is to be able to run the unit tests we have for Bioclipse from inside a running Bioclipse instance. That is, we have a Bioclipse Test Suite features on the update site, matching the functional features we have there. Each such test suite would run all JUnit tests we have for that feature.

1. users can verify that their installation is working as intended
2. the development team can easily run the test suite on foreign systems, without the need to install a fully operational Eclipse with Bioclipse development workspace
Now, the tricky thing is likely the following. How do we get to run all test suites? That is, I don't want to need to have to run the suites for each feature separately. Of course, this is exactly what extension points are for.

So, my question is, did anyone set up an system like this? And, is there an extension point that allows features to plugin additional JUnit test suites into a larger test suite dynamically?

## Friday, September 18, 2009

### JChemPaint update: merging of patches and CDK statistics

With the JChemPaint workshop just passed, there is much work from UU and the EBI to be integrated. Moreover, Rajarshi just merged a lot of fixes from CDK 1.2.x into the master branch, which will be a big rebase too. That said, I need to do this to recalculate source code statistics for the CDK.

The current set of JChemPaint patches looks like:

The two top most branches (bioclipse-2.1.x and 12-ebiStage) are actually staging branches: patches that have not yet been integrated into the JChemPaint-Primary branch. Likewise, the 0-other branch is a staging branch for patches that are in or up for the review process for CDK master itself.

This will mean that I am now going to rebase all these branches once more.

### My HTML+RDFa homepage

Finally got around to adding a few more bits to my new science homepage: egonw.github.com. Cool thing about this new page is that it is HTML+RDFa, so, my new FOAF profile is embedded in the HTML:

Down the bottom is link to extract the RDF triples:

Next, is to write a piece of code that creates HTML+RDFa+BIBO from a BibTeX file, and to write a plugin for Bioclipse to extract triples from HTML+RDFA.

## Thursday, September 17, 2009

### Really free chemistry books

With pleasure I read Analogue or Digital? - Both, Please. Funnily, I just created MP3 (or, preferably Ogg Vorbis, superior but hardly any support by commercial companies, who rather seem to pay license fees) directly from the CD.

Anyway... the blog wanders of to Google introducing searchable books, with many out-of-copyright. I was wondering how many chemistry books the pre-1923 book set included, and that actually sums up to about 41 thousand books, just for the chemistry search term.

There is quite cool stuff there, like the English translation of the works of Lavoisier.

This is really cool! I can just download this onto my eReader (which I don't have yet anyway, but my Dell laptop will do fine; if only the PDF was broken), but this actually allows me to read all the stuff I read about when doing History of Chemistry in the last year it was given in Nijmegen, back in 1993. Which was funny in itself, as the course was for second year students, but one of my introduction tutors suggested me to take it, which I did. It was a great course, by a great teacher, btw! It is a shame that the course was lost from the curriculum, much like I hated to see electrochemistry and cheminformatics lost in Nijmegen. Severe and very regrettable loss of diversity in the education there.

Anyways, I'm going to need hours to browse all the goodies there. Did you spot the 1913 copy of the CRC Handbook of Chemistry and Physics yet?

I am looking forward to seeing people starting text mining on these books... anyway?

## Sunday, September 13, 2009

### VRMS meme: how much non-free software are you running?

Over at Planet Ubuntu there is a meme running around VRMS (Virtual Richard M. Stallman, brilliant name!) which finds non-free software on your desktop. I uninstalled Sun's Java6, for which there is the OpenJDK6 alternative.

These are my current results:
Non-free packages installed on egonw-laptop

fglrx-modaliases          Identifiers supported by the ATI graphics driver
linux-generic             Complete Generic Linux kernel
linux-restricted-modules- Non-free Linux 2.6.28 modules helper script
linux-restricted-modules- Restricted Linux modules for generic kernels
skype                     Skype - Take a deep breath
sun-java5-bin             Sun Java(TM) Runtime Environment (JRE) 5.0 (architectu
sun-java5-demo            Sun Java(TM) Development Kit (JDK) 5.0 demos and examp
sun-java5-jdk             Sun Java(TM) Development Kit (JDK) 5.0
sun-java5-jre             Sun Java(TM) Runtime Environment (JRE) 5.0 (architectu

Contrib packages installed on egonw-laptop

flashplugin-installer     Adobe Flash Player plugin installer
flashplugin-nonfree       Adobe Flash Player plugin installer (transitional pack
msttcorefonts             transitional dummy package
ttf-mscorefonts-installer Installer for Microsoft TrueType core fonts

9 non-free packages, 0.4% of 2050 installed packages.
4 contrib packages, 0.2% of 2050 installed packages.


## Friday, September 11, 2009

### Bioclipse, RDF and defeasible reasoning

Well, I have yet to read the paper in detail, but my new student Samuel is going to work for 20 weeks on defeasible reasoning with DrProlog in Bioclipse.

## Thursday, September 10, 2009

### Open Chemical Data #1: NMRShiftDB

As I reported earlier, progress is only possible of you can modify and redistribute. This is why Open Data, Open Source, and Open Standards are so important to us Blue Obelisk members. For data, proper licensing makes these two requirements possible, but more importantly, make those rights explicit. Rich is running the nice Zusammen blog, but most of his entries are not Open Data. Even larger chemistry data repositories can be vague and have seemingly contradicting statements.

One project which did it right, was the NMRShiftDB. They were ahead of their time and did pick a proper Open license. By current standards not the best data license (the GNU FDL), but the best at the time. To push real Open Chemical Data a bit more, I will create a series much like Rich' series, but will make the restriction that the sources are clear about what rights they give users and that those include the rights to modify and redistribute the data without unreasonable restrictions.

I will not say much about the database itself, and even less now, as I think the NMRShiftDB is well-known amongst my readers.

Moreover, I have set up a FriendFeed room, Open Chemical Data, where I will aggregate feeds of new molecules in these databases:

Now, the only problem is, I need candidate for this series, and cannot actually think of a third entry (second being the Open Notebook Science Solubility data)... Want to help me out? Please let me know which chemical database is using a Open Data license.

## Wednesday, September 09, 2009

### The Art of Programming: do not leak implementation details

At the JChemPaint workshop here in Uppsala, where we have Mark from Chris' group as our guest, we encountered an inconsistency in CDK 1.2, where the bond stereochemistry did not yet follow the pattern recently adopted of having Class fields, to allow using null to have the semantics of undefined. Previously, the defaults for native values were confounded with set values. For example, the formal charge unset and 0 would be have a field value int = 0.

So, I am now writing a patch which replaces the use of int in IBond.getStereo(). But instead of going for Integer, the patch is actually going to use a enumeration.

Now, getting the the Art of Programming... while writing patches in the CDK, you run into those lovely bits of code, where intention is mixed with implementation details. They should not, and often do not need to, but they typically do. This is actually one reason why we now have a more strict peer-review installed. Below are two nice examples where intention is mixed with implementation detail.

Example 1
int stereo = container.getBond(chiralNeighbours.get(i), atom).getStereo();if (stereo == 0) {  // do something}
This code is bad because we have no clue of what this code is supposed to do. When should the if-clause kick in? Be reminded that the int = 0 has the confounded meanings of no stereochemistry and perhaps has stereochemistry, but no one ever bothered telling me. So, which of the two situations does the if clause apply to. So, my patch can only assume that both were applicable (following the actual implementation), though I don't think that makes sense on an algorithmic level. Had the author used CDKConstants.STEREO_BOND_NONE (which is the implementation for int = 0 for no stereochemistry, then I had known what the implementation was doing. Instead, the author chose to reuse implementation details: a hardcoded 0.

Example 2
There is another instance of this problem. Look at this lovely piece of code:
IBond bond = molecule.getBond(atomA, unplacedAtom);if (Math.abs(bond.getStereo()) < 2    && Math.abs(bond.getStereo()) != 0) {}
This example also uses hardcoded value, instead of the matching constants. Remember that int = 0 had the meaning of no stereochemistry, so I assume this code is determining is stereochemistry is defined for the bond, making nice use that those situations at some point were coded as non-zero values. Moreover, it is only interested in a few stereochemistry definitions, and from the implementation I learn (and that actually makes sense at this location) that it is only interested in those stereochemistry for which the first bond atom is the stereochemical center. This again is leaking implementation details, instead of using semantically meaningful constants.

## Tuesday, September 08, 2009

### Updated Bioclipse SDK: the Eclipse 3.5 version

Last Friday, the Bioclipse 2.1 development series moved to Eclipse 3.5, so I had to update the Bioclipse SDK too, which we developed earlier.

With a new Eclipse version also comes new screenshots to talk you through the process of setting up a new Bioclipse manager plugin.

Step 1
Right click in your workspace navigator, and choose New -> Project:

Step 2
And select to create a new Plug-in Project:

Step 3
Give a project name, such as net.bioclipse.xml:

Step 4
Tune the ID, Version, Name, and Provider to your liking:

Step 5
Then select Bioclipse Manager:

Step 6
The next wizard page is specific the the Bioclipse manager, and asks a manager namespace, which will be used as prefix in the JavaScript Console. For example, if I make the namespace xml, then I will type xml.someMethod() inside the JavaScript. The default manager name is typically OK by default:

Then click Finish and let Eclipse set up the new project.

Step 7
Because I have not figured out yet how to add Import-Package to the MANIFEST.MF programmatically, you will have to do this manually. Add the last line of the next screenshot to the MANIFEST.MF of your new plugin:

## Saturday, September 05, 2009

### NMRShiftDB RDF #2: Some statistics

This morning I had some more fun, and since the statistics view on the NMRShiftDB server is down, I though I could recalculate the statistics myself. Because the current RDF version of the data does not include all information yet, I cannot reproduce all of them. On the other hand, I can determine some other interesting statistics.

Spectra per spectrum type
One of the statistics given in the aforementioned page is the number of spectra per nuclei. This can be recalculated with the following SPARQL:

The results for the 1.3.3 release are:

nucleus count
13C 21958
1H 3031
11B 326
17O 131
15N 79
195Pt 68
19F 50
31P 38
73Ge 18
33S 8
29Si 5
I am a bit surprised by the count for the silicon NMR spectra, as I would have thought I alone had entered more than just five.

Molecules with the most spectra
It turns out that the molecules have in the 1.3.3 NMRShiftDB release at most 7 spectra, as I can calculate with:

That is going to change, as the paper I am digitizing now (doi:10.1021/jo971176v) has carbon and hydrogen NMR spectra for 7 solvents for each compound :) It should be possible to summarize the number of molecules for each number of spectra per molecule, but did not manage to get this SPARQL to work out well.

BTW, did you know you can find reprint PDFs of a paper (if any; this one happens to have a PDF copy) with Google using the title in quotes and filetype:pdf? Try this query. The top hit was molecule 10016314 (RDF), which has 4 13C spectra, one 15N and two proton NMR spectra.

Molecules with the most different nuclei
In the first query, we already save saw in the first SPARQL, there are 11 different nuclei in the database, though carbon and hydrogen are by far the most abundant spectra. I like diversity, so one statistic I find interesting, is the molecules which have spectra with the most different nuclei. This is done with the query:

It shows that molecule 10023801 (RDF) has 5 different NMR types: 13C spectra, one 15N, 29Si spectra, one 17O, and 1H spectra. Unfortunately, the compound also has chlorines, so it disqualifies as molecule for which NMR spectra are available for all its elements.

### NMRShiftDB RDF #1: Spectra by InChI

Originally, I wanted to include a SPARQL query in my yesterdays blog showing how to retrieve NMRShiftDB spectra based on an InChIKey, but it horribly failed. I have yet to discover why. This morning I discovered that it is specific for that field, and that using the same thing with InChI is no problem:

## Friday, September 04, 2009

### NMRShiftDB enters rdf.openmolecules.net #2: SPARQL end point with Virtuoso

About 6 months ago I reported about my efforts to RDF-ize the data from the NMRShiftDB. Since then, time was consumed by many other things, but now that Bioclipse can query SPARQL end points, that I want to contribute the triple set (it is GNU FDL-licensed) to Bio2RDF, that a student started working in my group (now larger than just me :) on reasoning on life sciences data, and that I recently contributed my 1000th NMR spectrum to the database, I thought it was time to finally reinstall Virtuoso.

There are precompiled binaries for Ubuntu and Debian, but Michel encouraged me to use version 6 when he visited us. And so I compiled and install 6.0.0.TP1 on the public server, while I do have the binary debs for 5.0.12 on my laptop. With some basic Apache magic, I hooked up the SPARQL end point of the server to the web:
<Proxy /nmrshiftdb/sparql>  RewriteEngine On  Allow from all  ProxyPass        http://localhost:8890/sparql  ProxyPassReverse http://localhost:8890/sparql</Proxy>
Nice thing about this is, that I can set up multiple servers, allowing me to keep incompatibly licensed data sets apart (see Open Data: license, rights, aggregation, clean interfaces?), which is the same approach Bio2RDF is taking.

The end point now offers about 278887 triples, but this will soon rise as I make more content from the database available in the original SQL database. The data is from the 1.3.3 release by Chris' team, and does not include my 1000th spectrum.

Getting the data into the database was not trivial either. The documentation suggests WebDAV, and that indeed worked for me once, after using the curl approach suggested here. But upon a second upload, it did again not enter the store. The ultimate solution was to use the iSQL interface, with the following SQL
DB.DBA.RDF_LOAD_RDFXML_MT(  file_to_string_output('/tmp/nmrshiftdb.rdf'), '',  'http://pele.farmbio.uu.se/nmrshiftdb');
Scientifically, this progress is not overly interesting, although it makes it very clear that you really should not have to be happy with proprietary and non-semantic formats for anything. But, to me, this is mostly a technological success of great importance: I can now share really large sets of RDF data.

Querying this data is a simple with SPARQL, and the results are available in various formats, such as JSON, which makes it easy to integrate in third-party applications or Google Wave robots (did I hear someone say NMRShifty?). As I have blogged before, SPARQL is an excellent tool to aggregate scientific data prior to data analysis. And I will demo more interesting queries later this month.

## Wednesday, September 02, 2009

### Open Knowledge: Reproducibility in Cheminformatics with ODOSOS

Below are the slides of my presentation of last Monday (see my earlier blogs):

### Google Wave robot for CDK functionality

I was really happy to hear early last week that I was invited to take part in the Google Wave beta, and received my account details this Monday, while at attending (and speaking at) the GDCh Wissenschaftsforum Chemie 2009. Yesterday was a travel day, and while working on course material for the Pharmaceutical Bioinformatics course that uses Bioclipse, I set up an Eclipse environment for development of a wave robot. Documentation was very clear, and deployment on Appspot one click on the appropriate button. Great work from the people from Google! It was all so easy, I could not resist pushing things a bit further, and looked carefully at other robots, like ChemSpidey by Cameron and Igor by Euan, to see how text replacement is done, and wrote my first functional robot, CDKitty (chemdevelkit@appspot.com):

It seems that it is a policy that wave robot names end with -y, so CDKitty sounded somewhat appropriate. Anyways, the robot is not overly functional yet, but it has a profile (which took some extra googling) and one function mwOf. Add the robot to your wave and prefix a molecular formula with mwOf:, and CDKitty will calculate the molecular formula on the fly. Clearly, this opens up a whole new application world for the CDK, and you can leave feature requests at the issue tracker of the project home at GitHub. Patches are most welcome too! :)

BTW, it seems I messed up the regular expression, which seems not to be including the last digit (filed as issue 1).

Almost forgot to add that: many thanx to Cameron for the insightful discussions we had over applecider, Weisse and German dinner on Monday evening!