Pages
Monday, August 30, 2010
Added a Simile Timeline to my homepage
Wednesday, August 25, 2010
Third #acs_boston talk: the Orbital Development Kit
The source code is available under LGPL 2.1 at my GitHub account.
Monday, August 23, 2010
First #acs_boston talk: Linking RDF to cheminformatics and proteochemometrics
Sunday, August 22, 2010
Tuesday, August 17, 2010
#ACS_Boston #acsrdf2010 ? It's all about Open Standards!
Besides the really interesting list of chemical domains covered in these talks (see below), this meeting, though I have not advertised it as such, is really about Open Standards in cheminformatics. This is what I just wrote to the Blue Obelisk mailing list (:
Dear ACS-Boston 2010 participants, with this email I would like to ask your attention to the three Semantic Web sessions on Sunday morning, and Monday morning and afternoon. During these sessions 15 presentations will be given on how the Resource Description Framework is changing cheminformatics. It is of special interest to the Blue Obelisk community that the family of RDF technologies (RDF, OWL, SPARQL) are all successful Open Standards, building on older and other Open Standards like HTTP, URI, XML (or JSON or Notation3 or Turtle), etc. Moreover, these technologies are not restricted to one particular domain, such as MDL/Symyx molfile, CML(XXX) are, as this part is covered by the Web Ontology Language (OWL)... each ontology in itself can be (and should ofcourse) be an Open Standard. In three sessions links to computational chemistry, knowledge management, and general applications are demonstrated, with wide variety of chemical disciplines as context, including toxicology, chemometrics, eScience, databases, wet lab experimentation, lipidomics, text mining, drug discovery and health care, gene-disease-pathway modeling, and chemical patents. All people interested in Open Cheminformatics should have a special interest in these three sessions, and I am very much looking forward to seeing you all next Sunday and/or Monday! The full program can be found at: http://egonw.github.com/acsrdf2010/ (please retweet and share with friends otherwise) with kind regards, Egon Willighagen
Saturday, August 14, 2010
Specifying unit test dependencies with JExample
A while ago I asked on StackOverflow about options to specify unit test dependencies. An example from the CDK would be that aromaticity detection in some molecule would require that all atom types are correctly perceived. I could test both in separate unit tests, but with JUnit I cannot have the aromaticity test ignored of the atom typing already failed. Now, being to specify such dependencies is useful, as the failing aromaticity test may be caused by the faulty atom typing, but the failing aromaticity test seems to imply something is wrong with the aromaticity detection algorithm.
One of the answers was JExample. Consider this code:
@RunWith(JExample.class)
public class MoleculeTest {
@Test
public MoleculeFactory testNewMoleculeFactory() {
Assert.fail("too bad");
return new MoleculeFactory();
}
@Given("#testNewMoleculeFactory()")
public IMolecule testNewMolecule(MoleculeFactory factory) {
Assert.fail("too bad");
return factory.getImmutable();
}
}
Reading this Java code requires some experience with JUnit4, but those who already know JUnit4 will also see that the code is somewhat unusual. For of all, we define a special class to run the tests with: JExample.class. A second thing to note is that the top test method actually returns something. The bottom test method is also differnt: instead of being annotated with @Test, it is annotated with @Given. This latter annotation is introduced by JExample to define the dependencies between the tests. The parameter of this annotation defines which test method it depends on. Moreover, it is now also clear what happens with the return value of the top test method: it is passed as argument to the second test method.Just to show that the system actually works, you can find a screenshot of the Eclipse JUnit View (the JExample update site did not work with Eclipse 3.6, so I have the JExample jar defined as build dependency in the project instead):
Clearly, only one test failed, while the testNewMolecule() method was ignored.
Now, I was wondering if the first method could be reused, and if so, if it would construct a new object. And it does, as is clear from the output of this code:
@Test
public MoleculeFactory testNewMoleculeFactory() {
return new MoleculeFactory();
}
@Given("#testNewMoleculeFactory()")
public IMolecule testNewMolecule(MoleculeFactory factory) {
System.out.println("Factory1: " + factory);
return factory.getImmutable();
}
@Given("#testNewMoleculeFactory()")
public IMolecule testNewMolecule2(MoleculeFactory factory) {
System.out.println("Factory2: " + factory);
return factory.getImmutable();
}
which outputs:
Factory2: com.github.egonw.odk.model.MoleculeFactory@7bd63e39 Factory1: com.github.egonw.odk.model.MoleculeFactory@42b988a6
This seems pretty useful. I am not sure yet if it solves all my dependency requirements, though. For example, what I really like to do, is link more complex tests to more than one more simpler tests, where it is clear that the complexer test will fail if one or more of the simpler tests fail. However, the JExample examples suggest that I can only define one dependency, whereas, for example, aromaticity detection depends on correct atom type perception, but also on ring detection. For now, this is a welcome extension of JUnit.
Further reading:
A. Kuhn, B. Van Rompaey, L. Hänsenberger, O. Nierstrasz, S. Demeyer, M. Gaelli, & K. Van Leemput (2008). JExample: Exploiting Dependencies Between Tests to Improve Defect Localization P. Abrahamsson (Ed.), Extreme Programming and Agile Processes in Software Engineering, 9th International Conference, XP 2008, Lecture Notes in Computer Science, 73-82 : 10.1007/978-3-540-68255-4_8
The Molecular Chemometrics Principles #3: stand on shoulders
Peter's post #solo10: Green Chain Reaction; where to store the data? DSR? IR? BioTorrent, OKF or ??? gives me enough basis to write up a third principle:
Molecular Chemometrics Principles #3: We make scientific progress if we build on past achievements.
Sounds logical, right? Practically, the way we share our cheminformatics knowledge makes this standing on shoulders pretty difficult. But there is one particular aspect I would like to ask your attention for: you can contribute by making clear what shoulders you would like to stand on. That is, where do you prefer to put your effort, and what message would you like to give to your user community.
In the aforelinked post, Peter asks where he should upload his data, and he suggest BioTorrent (see my review BitTorrents for Science), DSpace, and CKAN. Now, his Green Chain Reaction is picked up (see these few blog posts), and the resulting data should be distributed as much as possible. The exact location does not really matter...
But...
By picking where you upload, you make a statement to your community: "Look guys, we are distributing our data via Foo, because we believe those guys are doing good work! Perhaps you can support them too.".
This principle does not only apply to data, it applies to things too. For example, when iChemLabs and RSC ChemSpider Announce Partnership they do not just improve the user experience of ChemSpider (which I certainly won't object against), but they also imply "Look dudes, your product is just not good enough and we do not want to help you improve it either". Of course, ChemSpider has every right, and for them to succeed it is crucial to make decisions like this. Fortunately, ChemDoodle is GPL.
Every project with a user base has the opportunity to support shoulders, if they only visibly stand on them. By merely discussion the Green Chain Reaction, I show to support the this social web experiment. You can too. Use these powers wisely. May the McPrinciples be with you.
Thursday, August 12, 2010
The Molecular Chemometrics Principles #2: be clear in what you mean
Now, the first principle was about the need for access to data (McPrinciple #1). This principle goes without saying, one would think, but is not widely accepted yet. This is why Open Data promotion is still needed. For example, data in papers still is not freely redistributable, as Peter points out once again.
Anyway, this post is not about McPrinciple #1, but about the second principle.
Molecular Chemometrics Principles #2: In order to reproduce cheminformatics studies you need to be able to understand the input data.
Readers of my blog will surely recognize this theme. Clearly this theme explains my past fetish for the Chemical Markup Language, and my more recent work on the Resource Description Framework.
And it is so easy to jump to conclusions. Easy to make mistakes. And this is not just at the received side; the sending person may have accidentally made a mistake, or left something accidentally unclear, causing incorrect assumptions, and therefore errors in the cheminformatics computation. Now, if the data was semantically (clearly) annotated, and the meaning was clear, it was also trivial to see when a mistake had sneaked in. Think of it as a check bit.
"Well, isn't this a bit exaggerated," you might say. Perhaps, perhaps not. An simple, recent example. We all know SMILES, right? And we all know that lower case element symbols indicate aromaticity, right? That is, c1ccccc1 is aromatic, right? So, what's the problem then?
Now, consider the SMILES string c1ccc1. Lower case carbon element symbols, so aromatic, right? Oh, wait...
Therefore, be clear in what you mean. It saves us from a lot of trouble.
Further reading:
- The Molecular Chemometrics Principles #1: access to data
- Molecular Chemometrics, 2006 (doi:10.1080/10408340600969601)
Tuesday, August 10, 2010
XHTML+RDFa: chemical examples
Before I start, I like to put some emphasize on the following RDFa pattern. An RDF resource that serves as subject is always mapped to a HTML element. This can be a div element, but also other elements, as we will see in the example.
A molecule SMILES
The oldest RDFa example in my blog is from 2006. That was almost two years before the final Recommendation, and is not quite accurate anymore. But here's the correct version:
This example shows how to embed the SMILES string CCO semantically. This example shows that the outer most span element is used to define the subject of the RDF triple, using the @about attribute to define the URI of the resource: #ethanol. Note that this URI is relative to the URI of the HTML page in which it is embedded. Later we will see an example with a full URI.
But I don't want to hack HTML!
Yeah, fair point. Just make a point with your publisher when you submit a new paper. It is the duty of the publisher and your software vendor to do this right. In 2008 I wrote a small Ubiquity script to automagically convert an InChI into semantified HTML content. But I am not sure this script still works. If interesting, let me know, and I will revive the Firefox thingy.
And why would I want to do it anyway??
Because software can more easily understand what you mean. This is why Google is now pushing rich snippets. Chemical blogspace understands them too, allowing you to see blog posts about molecules on other webpages. With a simple bit of JavaScript you can link from your webpages, you can enrich your HTML sites with semantic chemistry yourself. Bioclipse also has no problem with extracting the RDF from HTML. Even Firefox can understand it. Really, there is no end to it.
Of course, why you should do this comes basically down to Molecular Chemometrics Principle #2, but I have not written that on up yet (see also McPrinciple #1).
Reporting problems with molecular representations
More recently, I reported about using RDFa in human readable log file for computations I am doing (see Scripts logs as HTML+RDFa: mix free text reporting with CSV). That code looks like:
This example uses a div element to host the subject resource. Again, the resource URI is relative to the URI of the document, e.g. this one. We can also note a new attribute, @typeof, which is here used to define the rdf:type of the #200234 resource.
This code snippet does not define the um namespace, which was done elsewhere in the HTML. Moreover, this code snippet does not actually reuse existing ontologies, which is highly recommended. The upcoming RDF symposium in Boston will tell you all about chemical ontologies in the RDF world (see this detailed program, which itself is HTML+RDFa!). But, if you would just overlook the ad hoc namespaces used, you might appreciate the nesting: besides the compound (#200234), a second resource is defined (#error0). In total, this example contains six triples.
Meanwhile, the output simply looks like:
A molecule table
This third, and for now last, example shows several other features. This HTML snippet show a one entry molecule table, very much like those molecular spreadsheets in Excel, but than right here in your webbrowser. (Can you imagine what happens if we mash this up with JavaScript molecular viewers? Enjoying the idea already :)
First of all, the rdf.openmolecules.net project is used to construct an absolute URI for the molecule. The table then gives some properties of the molecule: its name (using Dublin Core, though perhaps rdfs:label is better), the boiling point (nicely encoded as t0 in this 1947 paper), two cheminformatics descriptors, and the SMILES, using the same approach as the first example in this post.
The output of this table looks like:
| n-Butane | -0.5 | 10 | 1 | CCCC |
I will shortly blog about the source of the above code snippet, but you are invited to go ahead and checkout my GitHub activity (RSS).
Steffen, I think these examples should get you pretty far, but please let me know if you have further questions!
Monday, August 09, 2010
XHTML+RDFa Template
One technical approach to implement the idea of adding semantic to text in HTML pages is RDFa. But with any technology, scientists seems to have an in-built deficiency of thinking clearly, and anything beyond being able to format your bibliography with the correct bold and italic seems to be a bit much asked.
Anyway, for future reference, this is a basic HTML framework for embedding RDFa:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN"
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
version="XHTML+RDFa 1.0" xml:lang="en">
<head>
<title>ACS RDF Symposium, Boston, August 2010</title>
</head>
<body>
</body>
</html>
Perhaps you prefer the raw source.
The Molecular Chemometrics Principles #1: access to data
Indeed, I am not particularly one of those bloggers who spends trees after trees, in great detail explaining what is going on. I do make a lot of use of hyperlinking; much more than the average blogger. I actually assume that readers follow links, to read about the perspective of a blog post. But we all know that scientists do not read the cited papers in a paper they are reading, so who am I to assume blog readers would start doing that with blogs :)
Well, since principles seems popular, it might be a good start of my grand scheme that is behind this blog: the Molecular Chemometrics Principles. Hence, this first post about the why. The why is simply to provide a reference frame to what I am blogging about. In the next few posts on these McPrinciples (is that a catchy name, or what?) that will appear over the next two weeks, I will outline the code of chem-bla-ics. And, moreover, from now on, I will tag all my posts with the reaons why I make that post. I am sure that will not be too helpful for the occasional reader, but for anyone who is serious about chem-bla-ics, this will be a genuine gold mine of data for pattern recognition and data mining otherwise.
So, here goes.
Molecular Chemometrics Principles #1: In order to reproduce cheminformatics studies you need access to the input data.
The reason for this is that statistical modeling very much depends on the data on which modeling was done, patterns were recognized, etc. Therefore, without the input data, it is practically impossible to accurately reproduce results. Fortunately, the acceptance of the importance of access to data (e.g. as Open Data) is slowly getting momentum in science.
Further reading: Molecular Chemometrics, 2006 (doi:10.1080/10408340600969601)
Cleaner CDK Code #8: the Java Naming Conventions and Camel Casing
names. Class and interfaces, however, start with upper case characters. By all using these same conventions, we need to learn only one scheme and can more easily recognize what are variables, methods and classes. Have a look at these naming conventions by Oracle and the concept of Camel Casing, heavily used in Java.
BTW, note that the CDK conventions deviate from the Oracle conventions just linked to with respect to variable names. We have not found a way to have PMD to differentiate to where variables are used, and therefore require at least three characters per variable name, to enforce some meaningful naming, which is required by Oracle's conventions too.
Previous topics
- Cleaner CDK Code #7: understand what the code is supposed to do
- Cleaner CDK Code #6: set the CDKException's cause Exception
- Cleaner CDK Code #5: develop against interfaces
- Cleaner CDK Code #4: inheriting JavaDoc from super classes and interfaces
- Cleaner CDK Code #3: run the PMD tests
- Cleaner CDK Code #2: String.contains() and logger messages
- Cleaner CDK Code #1: List and the for-each loop
Friday, August 06, 2010
Oxford... #2
In the afternoon I walked around a bit more in Oxford, did some more shopping... and visited the Apple shop and played with an iPad. It's indeed a great piece of hardware. Looking forward to the first Android versions :)
Thursday, August 05, 2010
Cleaner CDK Code #7: understand what the code is supposed to do
Understand what the code is doing
Understanding what CDK code is doing is not always obvious. Theoretically, the mix of clean code, JavaDoc, and code comments explain enough, but sometimes it is not. Then, it is useful to see how code has evolved: who wrote the code; what was the commit message around that code (further documentation); what did the code look like before that change. This is where git blame comes in (see also how GitHub used this).
Now, to see the history of a source file, we do, for example:
git pickaxe -L 840 -- src/main/org/openscience/cdk/smiles/DeduceBondSystemTool.java
Which will output something like:
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 840) *
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 841) * @param interrupted true, if the calculation should be canceled
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 842) */
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 843) @TestMethod("testInterruption")
7c5c872c (egonw 2008-02-22 19:13:45 +0000 844) public void setInterrupted(boolean interrupted) {
7c5c872c (egonw 2008-02-22 19:13:45 +0000 845) this.interrupted = interrupted;
7c5c872c (egonw 2008-02-22 19:13:45 +0000 846) }
7c5c872c (egonw 2008-02-22 19:13:45 +0000 847)
00000000 (Not Committed Yet 2010-08-05 16:35:29 +0100 848) @TestMethod("testInterruption")
7c5c872c (egonw 2008-02-22 19:13:45 +0000 849) public boolean isInterrupted() {
7c5c872c (egonw 2008-02-22 19:13:45 +0000 850) return this.interrupted;
7c5c872c (egonw 2008-02-22 19:13:45 +0000 851) }
7c5c872c (egonw 2008-02-22 19:13:45 +0000 852)
7c5c872c (egonw 2008-02-22 19:13:45 +0000 853) }
The uncommited parts shows a patch I am working on. You see that these methods have last been touched by me. So, it is interesting to see why I added those, for which we look up the commit message (using he commit hash in the first column). We use git show --shortstat 7c5c872c:
commit 7c5c872c6be0e6a24a519eb227e7ed0b76ac37d6 Author: egonwDate: Fri Feb 22 19:13:45 2008 +0000 Merged the branch egonw/maintest: sets up src/main and src/test for splitting main library from unit tests git-svn-id: https://cdk.svn.sourceforge.net/svnroot/cdk/trunk/cdk@10219 eb4e18e3-b210-0410-a6ab-dec725e4b171 2340 files changed, 398392 insertions(+), 92370 deletions(-)
We now actually see an important commit in history. This was the commit where the source code was split up into the two folders we have right now.
Dealing with moved files
Normally, we could do the following to see what the code was like before this commit, by adding the revision hash and a ^ to indicate the last commit before that hash, with:
git pickaxe -L 840 7c5c872c^ -- src/main/org/openscience/cdk/smiles/DeduceBondSystemTool.java
However, this will not work here, because the source code originally was elsewhere. In particular, the code used to be all in src/. So, the original location was src/main/org/openscience/cdk/smiles/DeduceBondSystemTool.java. Therefore, we must use instead:
git pickaxe -L 840 7c5c872c^ -- src/org/openscience/cdk/smiles/DeduceBondSystemTool.java
And then we can continue again.
Previous topics
- Cleaner CDK Code #6: set the CDKException's cause Exception
- Cleaner CDK Code #5: develop against interfaces
- Cleaner CDK Code #4: inheriting JavaDoc from super classes and interfaces
- Cleaner CDK Code #3: run the PMD tests
- Cleaner CDK Code #2: String.contains() and logger messages
- Cleaner CDK Code #1: List and the for-each loop
Wednesday, August 04, 2010
Using Bioclipse to upload data to an OpenTox server
As, a result, you can (with the code at my laptop) now do (see this BSL script at MyExperiment):
// requires an unspecified Bioclipse development version
ds = opentox.createDataset("http://apps.ideaconsult.net:8080/ambit2/");
opentox.addMolecule(ds, cdk.fromSMILES("CCCCC[N+](C)(C)C"))
opentox.addMolecule(ds, cdk.fromSMILES("ClC(I)Br"))
opentox.deletaDataset(ds);
Make sure to check out the other stuff I have been doing with respect to OpenTox.
Is there an Open Specification for structure normalization?
-
Normalization is an important step in many cheminformatics workflows. Picking the right representation for a nitro-group, for example.
Are there best practices here? Should we initiate an Open Specification for normalization steps that should be performed? This would greatly increase the reproducibility in cheminformatics…
Sunday, August 01, 2010
Oxford...
Anyways, going to the Predictive Toxicology workshop, thanx to the bursary award I received from echeminfo (see Oxford, August 2010: eCheminfo Predictive ADME & Toxicology 2010 Workshop).
This afternoon I walked around a bit, watching all the old buildings. But I guess being here without anyone to share it with, and that it just looks Cambridge, makes me not-so-much impressed. Moreover, it's too busy with tourists and people randomly wearing Oxford University sweatshirts. Small and nice was the Museum of the History of Science, with some nice chemical pieces, like this one:
Buildings like the Radcliffe Camera are nice on the outside, but closed. Seems I have to become a fellow first. This is what it looked like today:
But the question is, of course, how long will we keep reading books... they're the hamburgers of educational material... Kindle and alikes will soon drop in price, and cost some €30 euro. But e-book prices will have to drop too, and I still do not get why an e-book is more expensive than a paperback... (see Amazon, the Kindle edition is more expensive than the paperback??). But then again... they are rich, and I am not.
There was some recent talk about the fact that no one can be Open to the full. You either do Open Data or Open Source, and make a living from the rest. That's where I nicely show I know bullocks of economics. I do BODR, CDK, ... all Open, all for free.
OK. That's a plus for Oxford... it makes you think about things. Perhaps there is something to morphogenetic fields...

