Monday, March 30, 2009

StARlite talks in Uppsala; Helena's Open Chemogenomics thesis

John was in Uppsala last Friday, and our group had the pleasure of talking to/with him before he was opponent to Helena defending her thesis on Chemogenomics: Models of Protein-Ligand Interaction Space (ISBN:978-91-554-7430-0). Since we believe we can do tons of really interesting science on John's StARlite data, I was excited to talk to him in person. He gave three talks that day, and managed to keep the overlap minimal (yes, not quite an absolute measure, but you get the point). We showed him the efforts of Arvid, Carl, Jonathan and me on converting the StARlite data to RDF, on which I will write shortly.

BTW, Helena's thesis is partly Open (see Peter's “should theses be Open?”). Partly, because it has the thesis parts which have not been published in journals before. I find this an interesting intermediate solution. Of course, all the really interesting bits are missing (the peer-reviewed papers), but at least puts the wrapping material in the Open (introduction, discussion, conclusion). I think I will do the same with my thesis (unless someone funds my papers to become Open Choice).

Monday, March 23, 2009

Highlighting Console output in Eclipse with Grep Console

I ran into an Eclipse Grep Console plugin (EPL license) today that takes regular expression to color output in the Console. Given the amount of output Bioclipse and the CDK give when in DEBUG mode, this allows me to highlight those bits I am interested in. For example, comments on the Bioclipse managers:

Sunday, March 22, 2009

Journal of Cheminformatics: I hope the Instructions to the Authors improve

Besides Nature Chemistry, another journal was launched last week (see here and here): the Journal of Cheminformatics. First of all, congratulations to Chris and David for their efforts! While the journal only published one research paper yet, it already found its place on Chemical blogspace. I have two things I want to blog about: data rich publishing, and starting the scientific communication.

Data Rich Publishing
Peter had a detailed blog about why he joined the editorial board:
    I take this position with some trepidation as I have grave reservations about the current practice of cheminformatics. It suffers from closed data, closed source and closed standards, and thereby generally poor experimental design, poor metrics and almost always irreproducible results and conclusions which are based on subjective opinions.
I strongly agree with this observation, and have discussed my view on this in my thesis (send me an email if you want a copy).

So, what has the journal to say about this (see Instructions to the Author, emphasis mine):
    Journal of Cheminformatics recommends, but does not require, that the source code of the software should be made available under a suitable open-source license that will entitle other researchers to further develop and extend the software if they wish to do so.
Regarding data, they even less revolutionary; recommended figures formats (EPS, PDF, PNG) focus on nice graphics instead of reuse of data. I also note that I cannot upload data in the Open Document Format, or, hey, let's really push things, in RDF. Well, not according to the Instructions. And surely, I can put the [O|R]DF in the supplementary information, anyway. It would also be nice if I could use Jmol as an applet to enrich the graphics, and improve data reusability of the paper, like the RSC recently started to allow.

Regarding the supplementary information, there is a section on additional files, which, unconveniently are capped at 20MB size. No mention of chemical formats at all, neither any recommendation on semantic formats like CML (I wonder when this was discussed with the Editorial Board, and where Peter was at the time). How am I going to put online my 500 molecular structure CML file now? (Though it's good to know it is virus scanned ;)

So, why do I vent my concerns about these limitations? I had not blogged about the launch of the journal earlier, because I have not made up my mind about it. On one side, I am happy to see a journal that promotes (scientific) use of papers, and a journal that allows me to keep copyright on the material. However, on the other side, what the current Instructions suggest, the data I could use from the papers is available only in an old-fashion way. That's a lost opportunity and could have killed competition for sure. Instead, the unique selling point is now restricted to using an open access license. Nature Chemistry, on the other hand, chose data rich publishing as a selling point (though in competition with things done at the RSC).

The other thing I want to mention about the journal is the following. Rajarshi blogged about Bachrach's paper on Chemistry publication - making the revolution (DOI:10.1186/1758-2946-1-2). Firstly, by adding a link like that for the DOI I just gave, Chemical blogspace can pick it up; we need this later. Secondly, the paper actually suggests that "[b]y publishing lots of data, available for ready re-use by all scientists, we can radically change the way science is communicated and ultimately performed"; this is in strong contrast to what I have seen in the Instructions so far.

Starting the Scientific Communication
Rich replied to Rajarshi about the requirement to log in before someone could make a comment, which he did not like. He suggested alternative ways to prevent SPAM and sorts. The choice for this commenting approach may also originate from having an Open discussion, where everyone takes responsibility for what he says. The use of OpenID, as Rich suggests would only partially address that; on the other hand, setting up a fake email address is quite common in the blogosphere too.

If Rajarshi would have used the DOI to link to the Steven's paper, as said, Chemical blogspace would have recognized it. Instead, he chose to link directly to the PDF. This is a typical case of hamburgers in action. However, others did when they discussed the first research paper in the journal (DOI:10.1186/1758-2946-1-3). These blogs were picked up by Cb and are listed on this page.

Now, I only need to remind you of Userscripts for the Life Sciences (DOI:10.1186/1471-2105-8-487) that we have the methods to link these comments back to the journal website. The Quotes from Chemical Blogspace and Postgenomic script in particular, does the hard work (needs GreaseMonkey, the script can be downloaded here; see also Noel's original post). This way, we can read the comments when we visit the papers homepage:

Now, the script has not yet been updated for the new journal (Noel, can you please upload the revision?), so you need to edit the source right now and add http://** to the list of website the script acts on:

Friday, March 20, 2009

Preferential positions of phophate counter ions

A long time ago ('96 or so?), as a student with the no longer existing CAOS/CAMM (Google shows some traces, like this chapter describing the centre), I did a short internship with Hilbert Bruijn-Slot (I hope I remember his name correctly), where has asked me to look at data in the CSD, and in particular the prefered position of phosphate counter ions. It was a fun research, and almost made it into a paper, if we were not just beating by a few months by a group of Russians who just published the same.

Today, Neil asked me to look at another Nature Chemistry paper (DOI:10.1038/nchem.100), and in particular its Chemical Compounds table. I could not directly spot the thing not in the table I discussed, but did notice the phosphate salts in the table. Not uncommonly, the counter ions are not near the phosphate in this diagram and I wondered how they did this in 3D.

Well, bringing back good memories to that internship I mentioned, the 3D model shown by Jmol actually does show the salt, and with the two sodiums near the phosphate; even better, they sit at very recognisable positions :)

Thursday, March 19, 2009

Nature Chemistry improves publishing chemistry: a detailed analysis

Nature Chemistry just released the first issue with a few free papers, like Asymmetric total syntheses of (+)- and (-)-versicolamide B and biosynthetic implications by Miller et al. (DOI:10.1038/nchem.110).

Now, we've seen the Royal Society of Chemistry's Project Prospect (see RSC: the first publisher to go semantic!) and ChemSpiders recent ChemMantis system which enriches the papers with machine readable representations of the molecules discussed in those papers. The new Nature publication has been in the works for a while, and they asked the community before what a Nature Chemistry paper should like like, and I replied in Re: What should a Nature Chemistry paper look like?.

The verdict
So, have the been listening? Is the HTML they produce semantic? Is it data rich? Or is it just another hamburger? Well, I am very happy to see some of the suggestions I made picked up (though I do not fool myself in believing I am the only one that suggested those features). A tour of good things, and points for improvement.

The first impression is not shocking; it looks like any other interface, with molecules drawn as images in the paper:

All structures that are numbered and linked (as in C6-epi-stephacidin A (Compound 13) have a hover-over function to popup a drawing of the structure:

The popup image is a nice gimmick, but not really sematically useful. The link, however, is! It points to a separate supplementary page with further information which include a image of the 2D structure and, following a link, the 3D structure in Jmol. Moreover, it comes with the machine readable representations:

This is indeed interesting, and a big step forward, though please do note my comments later. For convenience, all molecules with such supplementary information is available from the special Chemical Compounds section of the paper:

Excellent! This really is a step forward towards a data-rich paper! Indeed, I will shortly write up a Bioclipse plugin for Nature Chemistry, which will download all molecular structures based on the DOI! Anyway, more on that later... For this article, that table looks like:

By now, you likely also noted the links to PubChem, and indeed, upon publication of a paper, all structures are deposited in the public domain:

At last but not least, each molecule is available in the Chemical Markup Language (with 2D coordinates)! And you know I am a very happy CML user for a long time (see e.g. Peter's recent blog Egon Willighagen and CML). BTW, one comment on the CML: the namespace used is the outdated namespace, not the current one (see There can be only one (namespace)). (But the CDK and Bioclipse will read it anyway.)

Details matter
So, while the first impression was not shocking, it was a bit deceptive. Nature Chemistry really changes publishing of chemistry. But I have bad news too. The need to improve the HTML they produce.

But before pointing out some missed chances, let me reply inter alia to Peter's recent work on the Open Source plugin for including semantic chemistry in MS-Word documents (see How can we publish semantic chemical documents?): Nature Chemistry seems to have done a great job with existing tools. Nevertheless, I fully back up Peters comment that while the plugin is useless without Word, the results produced with the plugin are extremely Open Standard, and enormously reusable! Indeed, while the Word file format is only formally an true Open Standard, the file format is plain XML, and extracting content bearing the CML namespace is trivial.

Which reminds me, if someone from the Nature Chemistry team is reading this, please point me to a blog what tools actually are involved in publishing a Nature Chemistry paper! I think we all like to know.

Now, the HTML has room for improvement. First of all, a look at the metadata defined for the web page of the article shows a description and keywords about the journal, not the article, and the same goes for the web pages for the molecules:

Additionally, the compound details web page has no special markup for the machine readable information:

Or, if it does, it's still mixed with markup for visual pleasing output:

Still, the HTML is clean enough to have some regular expressions extract a good deal of information, and there is also still the PubChem deposition.

Beyond connection tables
Like many other chemistry journals, Nature Chemistry does not consider properties of the molecule interesting, and NMR spectra are hidden in the Supplementary Information. This paper in particular, disregards a lot of machine readable facts by putting all experimental section bits in a PDF document. So, the next challenge for Nature Chemistry will be to get the authors of papers contribute the original spectra (JCAMP-DX, CMLSpect, etc) in the supplementary information section. Better, have the raw data or even the NMR peak-atom annotations deposited in public repositories such (see Open NMR data: raw curves and annotated peak lists).

All in all, I am rather positive about the first Nature Chemistry issue, and like to thank the editors and paper authors for there efforts on improving publishing chemistry!

Wednesday, March 18, 2009

NMRShiftDB enters

This morning I finished setting up a RDF interface to the NMRShiftDB data (see nmr:234):

And made links between the new frontend and, make the Linked Open Chemistry Data (LOCD) network grow (naming following Linked Open Drug Data). In comparison with the previous depiction, I added arrows to indicate the direction of the linking. Green nodes still indicate sources with an RDF interface; therefore, the LOCD network consists really only of those green nodes:

The link with DBPedia is discussed in DBPedia enters The source code for the NMRShiftDB-RDF frontend can be found at GitHub.

Saturday, March 14, 2009

Autogenerating CML bindings for XMPP services with XMLBeans

I blogged earlier about our efforts to create a better SOAP service architecture, based on XMPP: So, I set up XMPP services for QSAR descriptor calculation, 2D diagram and 3D geometry calculations and a few more, using the CDK. Chemical Markup Language has been my primary choice for some 10 years now (see Peter's blog) as it allows me to do things I cannot do in other formats.

Now, our XMPP services publish themselves what data types the allow as input and what they output in return. They do this by publishing XML Schema to describe the input and output types. My CDK services use CML, so they return the CML schema. Johannes' xws4j implementation of the IO-DATA specification has an add on that can build bindings to the schema on the fly. Now, CML comes with a good XOM-based binding (called CMLXOM) so this is not strictly necessary, but for less common schemata it is worthwhile: you can always create bindings for brand new schemata, for older versions, for whatever. Services can even create their own local schemata, and people will still be able to easily use them. This is to me a big plus for this architecture.

Anyway, while CMLXOM exists, we wanted to show that the on-the-fly creation of bindings works, even for large schemata, such as CML. However, one of the older flavours had an small error in a regular expression in a data type CML defines. Johannes therefore asked me to test building bindings for the CML schema version used in my services. He adviced me to use scomp for this, which is a command line utility around the XMLBeans library used for the binding generation.

As I am running Ubuntu, I preferred installing the packaged version instead of installing the binary provided by XMLBeans. Now, after I did this, I noticed that this .deb did not install the scomp utility, so I filled a wishlist bug report. Earlier this week I already encountered another bug, but this package being Java, I had a good idea on how to fix the bug.

And so I implemented my own wishlist. I'm sure there is room for improvement, as my .deb packaging skills are a bit rusty (a very long time ago I have been in the Debian New Maintainers queue, but by the time they solved the long queue delays, I was too occupied with other things. Yes, this was a long time ago already :). Anyway, Ubuntu's LaunchPad has a nice feature, called the Personal Package Archives. This service will, after I have finished hacking on the packaging specs in the famous debian/ folder and tested the .debs build from it, will rebuild it and put the resulting package up for download.

Conclusion: a perfect opportunity to finally gives this a try. The learning curve was surprisingly shallow, and the result can be seen in my personal package archive:

Now, you can easily imagine that I will soon work on packaging stuff I did in the past too, such as update libcdk-java and now that OpenJDK in main can run Jmol reasonably, finally package Jmol for main. I just hope I remember my Alioth account, so that I can properly contribute to the debichem project.

Getting back to running scomp on the CML scheme, it works with one minor problem:
$ scomp -src . -d .  cml.xsd
/home/egonw/tmp/cml/cml.xsd:10098:9: warning: p-props-correct.2.2: maxOccurs must be greater than or equal to 1.
Time to build schema type system: 1.792 seconds
Time to generate code: 3.297 seconds
Time to compile code: 9.658 seconds
The problem is reflected by line 10098 which goes like:
<xsd:sequence minOccurs="0" maxOccurs="0">
which can be traced down to line 23 in schema2/trunk/elements/tableHeaderCell.xsd. I filled a bug report about this.

Thursday, March 12, 2009

Bioclipse: a powerful Jmol application

While Bioclipse is much more, it could be an interesting alternative to the Jmol application. It offers:
  • a scripting console
  • a file browser (the Eclipse way)
  • an outline of the file content which allows selections
  • a script editor
The underlying RCP toolkit has many other interesting features for a Jmol application, but the above is up and running:

Wednesday, March 04, 2009

Open NMR data: raw curves and annotated peak lists

Games are known to trigger technical innovation. But recently it also triggered innovation on open chemical databases. Jean-Claude reported:
    We are very excited by what we have put together so far. There are currently 457 H NMR, 389 C NMR, 11 IR and 29 NIR spectra. This is only possible because of people who submitted their spectra to ChemSpider as Open Data - please keep uploading!
Now, the NMRShiftDB also hosts quite a number of NMR spectra, and I have a hobby to submit spectra, particularly for rare nuclei. In particular, I think it is fun to to have as many as possible structures which have spectra for all the nuclei in that structure. Benzene is a simple example for which NMR spectra are available for all nuclei (see this entry).

Now, the main difference between the NMRShiftDB and ChemSpider spectral data is the the first are annotated peak lists (each shift is assigned to an atom), and the latter are full, but unannotated, spectral curves. So, there are quite a few things you could do here. For example, see which structures which NMR curves are not yet annotated in NMRShiftDB. Antony pointed me to this pages which is an overview of all spectral data in ChemSpider, but that page is difficult to machine process. Partly, because it is a mix of Open and Proprietary data, and partly because it uses JavaScript to navigate the table. (BTW, RDF interfaces to both resources would be much more helpful, and simply allow me to query all molecules which have a spectrum which is Open, and which is not found in the NMRShiftDB. I am working on a RDF interface to NMRShiftDB.)

Antony also asked me to advertise the option to upload Open spectral curves to ChemSpider. So, hereby. However, I really do hope ChemSpider will make it easier for others to reuse all the Open Data, as having to machine browsing the linked HTML interface is a waste of ChemSpider computing resources.

Update: the game is now available from

Tuesday, March 03, 2009

Open Data versus Capatalism?

Ian Davis was recently quoted saying open data is more important than open source, which was pulled (out of context) from this presentation. The context was (a slide earlier): Data outlasts code.

As far as I can see, this is utter nonsense, even within context of the slide (see also this discussion on FriendFeed). Obviously, within the context of Ian it does makes sense, and I hope he will respond in his blog and explain why he thinks Open Data is more special.

Without code, you have no way of accessing the data. Ask anyone to recover from a hard disk failure. In ODOSOS (Open Standards, Open Data, Open Source) they are all equal. You need them all for progress. You cannot single out one as being more important than another. Why would you anyway? Politics is all I can think of... All three combine and ensure our science is more efficient.

Fishy Perspective (what's in a name) comments on this in Data Vendetta, and I will take one quote out of context:
    Organizations are spending lot of money do generate proprietary data to safeguard its competitive edge, why you are convinced that they need to disclose that, no one is here for charity. Most the companies have their proprietary data policies, and they release the data in public only when there is sufficient overlap from publicly available databases.
Open Data versus Capitalism?
Companies are about money making, and there is nothing wrong with that. Others to work to make the world a better place.

If Rosalind had not shared her data (following Data Vendetta, and not going into whether she did willingly or knowingly), all current pharmaceutical research would have been delayed by half a year(?), more(?)... who knows. Even that half year would have meant quite a lot of death people. A lot of medicine would have not been discovered or hit the market at the same time. Capitalism is one thing, not good, not bad, orthogonal really. Capitalism as ideology does not contradict Open Data. But sharing knowledge as Open Data always has a positive effect on mankind.

If you want to make money, please do, as much as you can. But please pick carefully what you want to make money on. Be creative! Do some innovation! Be bold! Go where no one has gone before!

Sunday, March 01, 2009

Solubility Data in Bioclipse #4: Finding ChEBI IDs (Again, but better)

Those who carefully analyzed the second SPARQL query in Solubility Data in Bioclipse #3: Finding ChEBI IDs will have noticed the use of owl:sameAs. Those who did not, here's the query again:
PREFIX owl: <>
PREFIX ons: <>
PREFIX rdfonm: <>
PREFIX dc: <>

?solvent a ons:Solvent .
?solvent dc:title ?title .
?solvent owl:sameAs ?same .
?same rdfonm:chebiid ?chebi
This syntax is a bit clumsy, considering we said ?solvent and ?same are the same thing. Fortunately, there are tools that do take this into account. One such tool for Jena (which I use in Bioclipse) is Pellet. I just commited code for Bioclipse to use Pellet, which simplifies the above query to:
PREFIX ons: <>
PREFIX rdfonm: <>
PREFIX dc: <>

?solvent a ons:Solvent .
?solvent dc:title ?title .
?solvent rdfonm:chebiid ?chebi
The key thing here to understand, and I know this is rather abstract, is that the RDF document we build for the ONS Solubility data does not define the relation between Solvent and ChEBI identifiers, but using RDF we know this to be true. Only because the system now understands the owl:sameAs relation.

Now, Pellet does not stop there, and there are many more statements we can make. Even better, anyone can plug in such relations. Any database can define owl:sameAs and other relations, so that we can transparently browse the internet for chemistry in a semantically meaningful way.

I also know that the above is rather technical. For those chemists who have not stopped reading yet, what I would like to hear from you is what data you would like to see linked. It does not really matter what, because we can do it all (given Open Data).