Monday, August 30, 2010

Added a Simile Timeline to my homepage

I finally got around to updating my homepage. Not entirely done yet, but it was overdue. The changelog includes a removal of the sidebar, which did not look nice when the page was embedded in my blog (the Home tab), foaf:interest RDFa annotation, a ToC line just below the header, and a Simile Timeline widget with my research positions and a selection of peer-reviewed papers:

Tuesday, August 17, 2010

#ACS_Boston #acsrdf2010 ? It's all about Open Standards!

Less than a week left before the three ACS RDF 2010 sessions! I am both nervous and excited! We really got a great team of speakers together. As a final call for people still not sure if they want to attend, I sent out a few emails. On Monday, we are competing with the big JCIM 50th anniversary, but really, that it's 50 years of old cheminformatics, where our session is about the next 50 year. Your call ;) I was told the meetings are in two rooms along side, so you should be able to see a bit of both.

Besides the really interesting list of chemical domains covered in these talks (see below), this meeting, though I have not advertised it as such, is really about Open Standards in cheminformatics. This is what I just wrote to the Blue Obelisk mailing list (:
    Dear ACS-Boston 2010 participants,
    with this email I would like to ask your
    attention to the three Semantic Web
    sessions on Sunday morning, and Monday
    morning and afternoon. During these
    sessions 15 presentations will be given
    on how the Resource Description Framework
    is changing cheminformatics.
    It is of special interest to the Blue
    Obelisk community that the family of RDF
    technologies (RDF, OWL, SPARQL) are all
    successful Open Standards, building on
    older and other Open Standards like HTTP,
    URI, XML (or JSON or Notation3 or
    Turtle), etc.
    Moreover, these technologies are not
    restricted to one particular domain, such
    as MDL/Symyx molfile, CML(XXX) are, as
    this part is covered by the Web Ontology
    Language (OWL)... each ontology in itself
    can be (and should ofcourse) be an Open
    In three sessions links to computational
    chemistry, knowledge management, and
    general applications are demonstrated,
    with wide variety of chemical disciplines
    as context, including toxicology,
    chemometrics, eScience, databases, wet
    lab experimentation, lipidomics, text
    mining, drug discovery and health care,
    gene-disease-pathway modeling, and
    chemical patents.
    All people interested in Open
    Cheminformatics should have a special
    interest in these three sessions, and I
    am very much looking forward to seeing
    you all next Sunday and/or Monday!
    The full program can be found at:
    (please retweet and share with friends
    with kind regards,
    Egon Willighagen

Saturday, August 14, 2010

Specifying unit test dependencies with JExample

A while ago I asked on StackOverflow about options to specify unit test dependencies. An example from the CDK would be that aromaticity detection in some molecule would require that all atom types are correctly perceived. I could test both in separate unit tests, but with JUnit I cannot have the aromaticity test ignored of the atom typing already failed. Now, being to specify such dependencies is useful, as the failing aromaticity test may be caused by the faulty atom typing, but the failing aromaticity test seems to imply something is wrong with the aromaticity detection algorithm.

One of the answers was JExample. Consider this code:
public class MoleculeTest {

  public MoleculeFactory testNewMoleculeFactory() {"too bad");
    return new MoleculeFactory();

  public IMolecule testNewMolecule(MoleculeFactory factory) {"too bad");
    return factory.getImmutable();

Reading this Java code requires some experience with JUnit4, but those who already know JUnit4 will also see that the code is somewhat unusual. For of all, we define a special class to run the tests with: JExample.class. A second thing to note is that the top test method actually returns something. The bottom test method is also differnt: instead of being annotated with @Test, it is annotated with @Given. This latter annotation is introduced by JExample to define the dependencies between the tests. The parameter of this annotation defines which test method it depends on. Moreover, it is now also clear what happens with the return value of the top test method: it is passed as argument to the second test method.

Just to show that the system actually works, you can find a screenshot of the Eclipse JUnit View (the JExample update site did not work with Eclipse 3.6, so I have the JExample jar defined as build dependency in the project instead):

Clearly, only one test failed, while the testNewMolecule() method was ignored.

Now, I was wondering if the first method could be reused, and if so, if it would construct a new object. And it does, as is clear from the output of this code:
  public MoleculeFactory testNewMoleculeFactory() {
    return new MoleculeFactory();

  public IMolecule testNewMolecule(MoleculeFactory factory) {
    System.out.println("Factory1: " + factory);
    return factory.getImmutable();

  public IMolecule testNewMolecule2(MoleculeFactory factory) {
    System.out.println("Factory2: " + factory);
    return factory.getImmutable();

which outputs:

Factory2: com.github.egonw.odk.model.MoleculeFactory@7bd63e39
Factory1: com.github.egonw.odk.model.MoleculeFactory@42b988a6

This seems pretty useful. I am not sure yet if it solves all my dependency requirements, though. For example, what I really like to do, is link more complex tests to more than one more simpler tests, where it is clear that the complexer test will fail if one or more of the simpler tests fail. However, the JExample examples suggest that I can only define one dependency, whereas, for example, aromaticity detection depends on correct atom type perception, but also on ring detection. For now, this is a welcome extension of JUnit.

Further reading:

A. Kuhn, B. Van Rompaey, L. Hänsenberger, O. Nierstrasz, S. Demeyer, M. Gaelli, & K. Van Leemput (2008). JExample: Exploiting Dependencies Between Tests to Improve Defect Localization P. Abrahamsson (Ed.), Extreme Programming and Agile Processes in Software Engineering, 9th International Conference, XP 2008, Lecture Notes in Computer Science, 73-82 : 10.1007/978-3-540-68255-4_8

The Molecular Chemometrics Principles #3: stand on shoulders

I have blogged about two Molecular Chemometrics principles so far:
Peter's post #solo10: Green Chain Reaction; where to store the data? DSR? IR? BioTorrent, OKF or ??? gives me enough basis to write up a third principle:

Molecular Chemometrics Principles #3: We make scientific progress if we build on past achievements.

Sounds logical, right? Practically, the way we share our cheminformatics knowledge makes this standing on shoulders pretty difficult. But there is one particular aspect I would like to ask your attention for: you can contribute by making clear what shoulders you would like to stand on. That is, where do you prefer to put your effort, and what message would you like to give to your user community.

In the aforelinked post, Peter asks where he should upload his data, and he suggest BioTorrent (see my review BitTorrents for Science), DSpace, and CKAN. Now, his Green Chain Reaction is picked up (see these few blog posts), and the resulting data should be distributed as much as possible. The exact location does not really matter...


By picking where you upload, you make a statement to your community: "Look guys, we are distributing our data via Foo, because we believe those guys are doing good work! Perhaps you can support them too.".

This principle does not only apply to data, it applies to things too. For example, when iChemLabs and RSC ChemSpider Announce Partnership they do not just improve the user experience of ChemSpider (which I certainly won't object against), but they also imply "Look dudes, your product is just not good enough and we do not want to help you improve it either". Of course, ChemSpider has every right, and for them to succeed it is crucial to make decisions like this. Fortunately, ChemDoodle is GPL.

Every project with a user base has the opportunity to support shoulders, if they only visibly stand on them. By merely discussion the Green Chain Reaction, I show to support the this social web experiment. You can too. Use these powers wisely. May the McPrinciples be with you.

Thursday, August 12, 2010

The Molecular Chemometrics Principles #2: be clear in what you mean

I noted earlier this week that [d]uring the week [in Oxford], someone (name and address is know at the editorial office) commented on the fact that my blog posts are somewhat difficult to follow; that is, it's often not clear why I am posting what I am posting. This triggered the start of a series of principles in the field I coined Molecular Chemometrics, and the promise that I will try to indicate in each blog post to which of these principles it relates. Just to put things in a bit more perspective; to make a bit more clear why I am blogging about that bit; just to be clear in what I mean.

Now, the first principle was about the need for access to data (McPrinciple #1). This principle goes without saying, one would think, but is not widely accepted yet. This is why Open Data promotion is still needed. For example, data in papers still is not freely redistributable, as Peter points out once again.

Anyway, this post is not about McPrinciple #1, but about the second principle.

Molecular Chemometrics Principles #2: In order to reproduce cheminformatics studies you need to be able to understand the input data.

Readers of my blog will surely recognize this theme. Clearly this theme explains my past fetish for the Chemical Markup Language, and my more recent work on the Resource Description Framework.

And it is so easy to jump to conclusions. Easy to make mistakes. And this is not just at the received side; the sending person may have accidentally made a mistake, or left something accidentally unclear, causing incorrect assumptions, and therefore errors in the cheminformatics computation. Now, if the data was semantically (clearly) annotated, and the meaning was clear, it was also trivial to see when a mistake had sneaked in. Think of it as a check bit.

"Well, isn't this a bit exaggerated," you might say. Perhaps, perhaps not. An simple, recent example. We all know SMILES, right? And we all know that lower case element symbols indicate aromaticity, right? That is, c1ccccc1 is aromatic, right? So, what's the problem then?

Now, consider the SMILES string c1ccc1. Lower case carbon element symbols, so aromatic, right? Oh, wait...

Therefore, be clear in what you mean. It saves us from a lot of trouble.

Further reading:

Tuesday, August 10, 2010

XHTML+RDFa: chemical examples

Steffen asked me if I could also provide a few examples on how to actually put RDF triples in the HTML, as the template I gave yesterday is a mere empty canvas to draw the triples on. There are actually various examples in my blog, which I will summarize here.

Before I start, I like to put some emphasize on the following RDFa pattern. An RDF resource that serves as subject is always mapped to a HTML element. This can be a div element, but also other elements, as we will see in the example.

A molecule SMILES
The oldest RDFa example in my blog is from 2006. That was almost two years before the final Recommendation, and is not quite accurate anymore. But here's the correct version:

This example shows how to embed the SMILES string CCO semantically. This example shows that the outer most span element is used to define the subject of the RDF triple, using the @about attribute to define the URI of the resource: #ethanol. Note that this URI is relative to the URI of the HTML page in which it is embedded. Later we will see an example with a full URI.

But I don't want to hack HTML!
Yeah, fair point. Just make a point with your publisher when you submit a new paper. It is the duty of the publisher and your software vendor to do this right. In 2008 I wrote a small Ubiquity script to automagically convert an InChI into semantified HTML content. But I am not sure this script still works. If interesting, let me know, and I will revive the Firefox thingy.

And why would I want to do it anyway??
Because software can more easily understand what you mean. This is why Google is now pushing rich snippets. Chemical blogspace understands them too, allowing you to see blog posts about molecules on other webpages. With a simple bit of JavaScript you can link from your webpages, you can enrich your HTML sites with semantic chemistry yourself. Bioclipse also has no problem with extracting the RDF from HTML. Even Firefox can understand it. Really, there is no end to it.

Of course, why you should do this comes basically down to Molecular Chemometrics Principle #2, but I have not written that on up yet (see also McPrinciple #1).

Reporting problems with molecular representations
More recently, I reported about using RDFa in human readable log file for computations I am doing (see Scripts logs as HTML+RDFa: mix free text reporting with CSV). That code looks like:

This example uses a div element to host the subject resource. Again, the resource URI is relative to the URI of the document, e.g. this one. We can also note a new attribute, @typeof, which is here used to define the rdf:type of the #200234 resource.

This code snippet does not define the um namespace, which was done elsewhere in the HTML. Moreover, this code snippet does not actually reuse existing ontologies, which is highly recommended. The upcoming RDF symposium in Boston will tell you all about chemical ontologies in the RDF world (see this detailed program, which itself is HTML+RDFa!). But, if you would just overlook the ad hoc namespaces used, you might appreciate the nesting: besides the compound (#200234), a second resource is defined (#error0). In total, this example contains six triples.

Meanwhile, the output simply looks like:
CID 200234: Ti1

A molecule table
This third, and for now last, example shows several other features. This HTML snippet show a one entry molecule table, very much like those molecular spreadsheets in Excel, but than right here in your webbrowser. (Can you imagine what happens if we mash this up with JavaScript molecular viewers? Enjoying the idea already :)

First of all, the project is used to construct an absolute URI for the molecule. The table then gives some properties of the molecule: its name (using Dublin Core, though perhaps rdfs:label is better), the boiling point (nicely encoded as t0 in this 1947 paper), two cheminformatics descriptors, and the SMILES, using the same approach as the first example in this post.

The output of this table looks like:
n-Butane -0.5 10 1 CCCC

I will shortly blog about the source of the above code snippet, but you are invited to go ahead and checkout my GitHub activity (RSS).

Steffen, I think these examples should get you pretty far, but please let me know if you have further questions!

Monday, August 09, 2010

XHTML+RDFa Template

There was some more discussion on machine readability of notebooks again, something I have blogged about for a long time now.

One technical approach to implement the idea of adding semantic to text in HTML pages is RDFa. But with any technology, scientists seems to have an in-built deficiency of thinking clearly, and anything beyond being able to format your bibliography with the correct bold and italic seems to be a bit much asked.

Anyway, for future reference, this is a basic HTML framework for embedding RDFa:

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns=""
    version="XHTML+RDFa 1.0" xml:lang="en">
    <title>ACS RDF Symposium, Boston, August 2010</title>

Perhaps you prefer the raw source.

The Molecular Chemometrics Principles #1: access to data

The meetings in and around Oxford were great! I already wrote that the Predictive Toxicology workshop was brilliant (see Oxford... #1 and Oxford... #2), but I also very, very much enjoyed meeting up with Dan and Nico! During the week, someone (name and address is know at the editorial office) commented on the fact that my blog posts are somewhat difficult to follow; that is, it's often not clear why I am posting what I am posting.

Indeed, I am not particularly one of those bloggers who spends trees after trees, in great detail explaining what is going on. I do make a lot of use of hyperlinking; much more than the average blogger. I actually assume that readers follow links, to read about the perspective of a blog post. But we all know that scientists do not read the cited papers in a paper they are reading, so who am I to assume blog readers would start doing that with blogs :)

Well, since principles seems popular, it might be a good start of my grand scheme that is behind this blog: the Molecular Chemometrics Principles. Hence, this first post about the why. The why is simply to provide a reference frame to what I am blogging about. In the next few posts on these McPrinciples (is that a catchy name, or what?) that will appear over the next two weeks, I will outline the code of chem-bla-ics. And, moreover, from now on, I will tag all my posts with the reaons why I make that post. I am sure that will not be too helpful for the occasional reader, but for anyone who is serious about chem-bla-ics, this will be a genuine gold mine of data for pattern recognition and data mining otherwise.

So, here goes.

Molecular Chemometrics Principles #1: In order to reproduce cheminformatics studies you need access to the input data.

The reason for this is that statistical modeling very much depends on the data on which modeling was done, patterns were recognized, etc. Therefore, without the input data, it is practically impossible to accurately reproduce results. Fortunately, the acceptance of the importance of access to data (e.g. as Open Data) is slowly getting momentum in science.

Further reading: Molecular Chemometrics, 2006 (doi:10.1080/10408340600969601)

Cleaner CDK Code #8: the Java Naming Conventions and Camel Casing

Another simple approach to make your code more readable, is to adhere to the Java naming conventions. They prescribe that variables start with a lower case characters, as do method
names. Class and interfaces, however, start with upper case characters. By all using these same conventions, we need to learn only one scheme and can more easily recognize what are variables, methods and classes. Have a look at these naming conventions by Oracle and the concept of Camel Casing, heavily used in Java.

BTW, note that the CDK conventions deviate from the Oracle conventions just linked to with respect to variable names. We have not found a way to have PMD to differentiate to where variables are used, and therefore require at least three characters per variable name, to enforce some meaningful naming, which is required by Oracle's conventions too.

Previous topics

Friday, August 06, 2010

Oxford... #2

The Predictive Toxicology meeting is over. It was a great meeting, by any standard. Very much recommended, and many thanx to Barry for the organization! The meeting was a true workshop, with a mix of presentations and getting work done. I participated in a group that looked at mutagenicity of potential anti-malaria drugs from the datasets of GSK and Novartis recently release as Open Data. We used various tools to predict properties, and plan to make all our results freely available soon. Otherwise, it was also great to meet Nina again (with whom I talked about OpenTox), and to meet other CDK users, including Patrik (SMARTCyp, doi:10.1021/ml100016x) and David (Inkspot).

In the afternoon I walked around a bit more in Oxford, did some more shopping... and visited the Apple shop and played with an iPad. It's indeed a great piece of hardware. Looking forward to the first Android versions :)

Thursday, August 05, 2010

Cleaner CDK Code #7: understand what the code is supposed to do

It has been a while since I posted a blog post in this series (see below for a full list so far), but was fixing a problem for Nina (OpenTox), and found some code I did not understand. So, here's another useful tip (IMHO) for writing CDK code. Well, in particular, patching CDK code in this case.

Understand what the code is doing
Understanding what CDK code is doing is not always obvious. Theoretically, the mix of clean code, JavaDoc, and code comments explain enough, but sometimes it is not. Then, it is useful to see how code has evolved: who wrote the code; what was the commit message around that code (further documentation); what did the code look like before that change. This is where git blame comes in (see also how GitHub used this).

Now, to see the history of a source file, we do, for example:
git pickaxe -L 840 -- src/main/org/openscience/cdk/smiles/

Which will output something like:
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 840)      *
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 841)      * @param interrupted true, if the calculation should be canceled
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 842)      */
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 843)     @TestMethod("testInterruption")
7c5c872c (egonw             2008-02-22 19:13:45 +0000 844)      public void setInterrupted(boolean interrupted) {
7c5c872c (egonw             2008-02-22 19:13:45 +0000 845)              this.interrupted = interrupted;
7c5c872c (egonw             2008-02-22 19:13:45 +0000 846)      }
7c5c872c (egonw             2008-02-22 19:13:45 +0000 847) 
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 848)     @TestMethod("testInterruption")
7c5c872c (egonw             2008-02-22 19:13:45 +0000 849)      public boolean isInterrupted() {
7c5c872c (egonw             2008-02-22 19:13:45 +0000 850)              return this.interrupted;
7c5c872c (egonw             2008-02-22 19:13:45 +0000 851)      }
7c5c872c (egonw             2008-02-22 19:13:45 +0000 852) 
7c5c872c (egonw             2008-02-22 19:13:45 +0000 853) }

The uncommited parts shows a patch I am working on. You see that these methods have last been touched by me. So, it is interesting to see why I added those, for which we look up the commit message (using he commit hash in the first column). We use git show --shortstat 7c5c872c:

commit 7c5c872c6be0e6a24a519eb227e7ed0b76ac37d6
Author: egonw 
Date:   Fri Feb 22 19:13:45 2008 +0000

    Merged the branch egonw/maintest: sets up src/main and src/test for splitting main library from unit tests
    git-svn-id: eb4e18e3-b210-0410-a6ab-dec725e4b171

 2340 files changed, 398392 insertions(+), 92370 deletions(-)

We now actually see an important commit in history. This was the commit where the source code was split up into the two folders we have right now.

Dealing with moved files
Normally, we could do the following to see what the code was like before this commit, by adding the revision hash and a ^ to indicate the last commit before that hash, with:
git pickaxe -L 840 7c5c872c^ -- src/main/org/openscience/cdk/smiles/

However, this will not work here, because the source code originally was elsewhere. In particular, the code used to be all in src/. So, the original location was src/main/org/openscience/cdk/smiles/ Therefore, we must use instead:
git pickaxe -L 840 7c5c872c^ -- src/org/openscience/cdk/smiles/

And then we can continue again.

Previous topics

Wednesday, August 04, 2010

Using Bioclipse to upload data to an OpenTox server

As part of a continuing mashup of Bioclipse and OpenTox, I sat down with Nina in Oxford to implement uploading molecules from within Bioclipse with JavaScript to OpenTox servers. This opens the route to calculate QSAR descriptors using the OpenTox API.

As, a result, you can (with the code at my laptop) now do (see this BSL script at MyExperiment):

// requires an unspecified Bioclipse development version

ds = opentox.createDataset("");

opentox.addMolecule(ds, cdk.fromSMILES("CCCCC[N+](C)(C)C"))
opentox.addMolecule(ds, cdk.fromSMILES("ClC(I)Br"))


Make sure to check out the other stuff I have been doing with respect to OpenTox.

Is there an Open Specification for structure normalization?

Over at the Blue Obelisk eXchange, I just posted this question:
    Normalization is an important step in many cheminformatics workflows. Picking the right representation for a nitro-group, for example.
    Are there best practices here? Should we initiate an Open Specification for normalization steps that should be performed? This would greatly increase the reproducibility in cheminformatics…
Please post your ideas, comments, etc.

Sunday, August 01, 2010


Yesterday I arrived in Oxford, after a 3.5 hour bus transfer from London Stansted. Long, boring ride (though I might have seen a few red kites, but seeing that they were near extinct, I am wondering what other large bird of prey has strong split tail like a swallow). Showed once more that the UK infrastructure has hardly changed since the 19th century. Enjoying an undergraduate room at one of the colleges. Pretty basic, but makes me feel more like a human than a tourist. Yes!, undergraduate students are human too! One of the advantages is you get an excellent internet connection :)

Anyways, going to the Predictive Toxicology workshop, thanx to the bursary award I received from echeminfo (see Oxford, August 2010: eCheminfo Predictive ADME & Toxicology 2010 Workshop).

This afternoon I walked around a bit, watching all the old buildings. But I guess being here without anyone to share it with, and that it just looks Cambridge, makes me not-so-much impressed. Moreover, it's too busy with tourists and people randomly wearing Oxford University sweatshirts. Small and nice was the Museum of the History of Science, with some nice chemical pieces, like this one:

Buildings like the Radcliffe Camera are nice on the outside, but closed. Seems I have to become a fellow first. This is what it looked like today:

Quite interesting too was the Oxford University Press shop. I'm a sucker for books. Apparently, you can just write a book and publish it. For example, an extensive list of dictionaries on about anything... and since I have been writing several book chapters right now, perhaps this is actually an interesting route...

But the question is, of course, how long will we keep reading books... they're the hamburgers of educational material... Kindle and alikes will soon drop in price, and cost some €30 euro. But e-book prices will have to drop too, and I still do not get why an e-book is more expensive than a paperback... (see Amazon, the Kindle edition is more expensive than the paperback??). But then again... they are rich, and I am not.

There was some recent talk about the fact that no one can be Open to the full. You either do Open Data or Open Source, and make a living from the rest. That's where I nicely show I know bullocks of economics. I do BODR, CDK, ... all Open, all for free.

OK. That's a plus for Oxford... it makes you think about things. Perhaps there is something to morphogenetic fields...