Saturday, November 27, 2010

Uppsala Status Report

As you know, my post-doc in Uppsala ended. It was a good time, and it was great collaborating on Bioclipse with Ola, Jonathan, Arvid, and Carl. I would have loved tighter integration with the work of Maris and Martin, but that was limited to one joined paper (in press). I thank Professors Jarl Wikberg and Eva Brittebo for allowing me to continue my research at their department, and hope this is not the end of the collaboration yet.

Like with new year, the end of a contract is a good time to reflect on ones accomplishments. It's been a bit delayed, but as you know, I already in my next project in Cambridge, and will start in January with yet a longer term position in predictive toxicology (more on that soon). This makes this a really crowded period, on top of birthdays, Sinterklaas, x-mas, and sorts.

My Research
As you might know, my research interest lies in understanding molecular properties and their applications in larger molecular systems. This can be how small molecules pack in crystals, finding patterns in properties (QSAR-like work), etc. Because the underlying methods are useful in many domains, you see applications in various too, including drug discovery, metabolomics, etc. These methods involve statistics and cheminformatics, primarily, which is clear from my publications on method development in chemometrics and cheminformatics. You will also have seen that visualization is a very important tool here, as our numerical validation can easily mislead even a trained scientist.

How Uppsala fits in
About 30 months ago, I got an offer to join the Bioclipse team to work on the cheminformatics features of the workbench. It was already using the CDK, so the project was a tight match with what I did in the past. Additionally, there were plans to integrate R, and while the latter is partially implemented, that part was unfortunately not completed by the group yet. I believe this is a crucial aspect, and without it the large-scale impact of Bioclipse will be severely reduced.

Bioclipse is positioned as a workbench to use third-party libraries, web services, databases, etc, and has done so very successfully (doi:10.1186/1471-2105-8-59). It speaks many Open Standards, and already incorporates various important Open Source libraries for life sciences research, including the aforementioned CDK (doi:10.2174/138161206777585274, doi:10.1021/ci025584y), but also Jmol, JChemPaint, BioJava (doi:10.1093/bioinformatics/btn397), and others. Using these libraries it has rich visualization means for life sciences data, including molecules and protein sequences. The latter, of course, is directly related to the proteochemometrics research done in Wikberg's group. Recently, Bioclipse adopted scripting functionality, making it a perfect tool to share life sciences computation, just like Taverna (doi:10.1093/bioinformatics/bth361) and KNIME.

Where I hoped to do some research in proteochemometrics, events lead me into different areas, which I explained in Why you have not heard me much about chemometrics recently....

But, Bioclipse provides me with the tools I need to take molecular chemometrics (doi:10.1080/10408340600969601) forward.

So, what has this resulted in, besides a number of unsuccessful grant applications? We're still counting, but two book chapters, a book on pharmaceutical bioinformatics, one proceedings paper, five research papers, seven oral presentations at international meetings, and a ACS conference on RDF in chemistry. Oh, and tons of Open Source code, of course. (I'm at the edge of collapsing; I did that as student, lost a year, but learned a lot about myself ... these results I have worked very hard for; I am not a miracle worker. And I have to disappoint people occasionally, as things do not work out how I expected them to be. My apologies for that.)

I will not describe all in detail now, but focus on a few things around what I made my research in Uppsala: semantic cheminformatics, which I believe to be a key concept of where cheminformatics must be going. The first paper resulted from a collaboration with Johannes, a medical researcher at the Ludwig-Maximilians-Universität in Germany (full reference at the end of the post). This work provides an alternative to SOAP, which has a better solution to asynchronous computing that the polling approaches now commonly used. A XMPP-based service just reports back when it is done, so that you do not have to ask all the time. Makes sense to me. We made the platform available to Bioclipse and Taverna, and demonstrated the technology with applications in life sciences, including (QSAR) descriptor calculation, and susceptibility for seven known HIV protease inhibitors.

This work stresses that if we really want to, we can significantly improve scientific computing. It's very much like what Peter concluded this week: "None of this is rocket science - it’s purely a question of will". This is what I have being trying to show in the past few years. The disuse of accurate scientific computing is a deliberate choice. Making your cheminformatics research irreproducible is a choice, and a bad one too. There can be acceptable reasons, but the choice would be bad nevertheless (I hope that distinction is clear: you can have valid reasons to do something intrinsically wrong. You will be forgiven, and be encouraged to change your behavior.) Many people on the Blue Obelisk community are laying out the foundations and show cases, hoping to make it easier for others to change behavior. I think we have been quite successful there.

Anyways... on to the second paper. As said, Bioclipse is the platform that can bring these new cheminformatics methods to the desktop. The new and improved Bioclipse 2 (see citation below) adds one important new feature: scripting. My work in this paper focuses on doing making sure the cheminformatics library was properly integrated, continued development of JChemPaint (yet unpublished, and in collaboration with the EBI, but see for example this blog post), and helping Ola and other to properly use the CDK in their applications (MetaPrint2D (doi:10.1186/1471-2105-11-362), etc.). The impact of this work goes far beyond the papers on which I am author, though not every reviewer will understand that, unfortunately. This work is really the plumbing, it's the development of the measuring machines to do the job, the development of a STM device to actually get going.

The third paper, also listed at the end, is about defining a standard for detailed exchange of QSAR data. It defines what information is needed to reproduce a set of QSAR descriptors, including the input, and using a descriptor ontology which we published about before (doi:10.1021/ci050400b). This project can be the seed of a public repository of QSAR data, where it will be clear what is meant, and how the data can be used. If you are interested in setting up such a public repository, please contact me or Ola.

That leaves me to the work that I have initiated in the group: the use of RDF technologies (I do hope all VR reviewers are listening). RDF provide a lingua franca for data exchange in life sciences, and the meaning of words is provided by sharing dictionaries (ontologies). Bioclipse has been extended to speak RDF, and we developed various applications based on it. A proceedings previews the effort, while the paper is in print in the new Open Access Journal of Biomedical Semantics. Of course, you can also read much about this topic in this blog.

RDF is going to change bio- and cheminformatics in ways the XML has been unable to do. Various papers are currently in preparation to provide detailed uses case and related research. I am very excited about this technology which further improved interoperability and reproducibility in cheminformatics. Should you care about that? Yes, because by using these good practices, research will be easier to interpret, conclusions judges, and as such, we can focus on the underlying chemistry in much more details, instead of looking at noise which many current cheminformatics literature is doing. (Ouch, that's a bold statement indeed. True? Well, without reproducibility it is hard to tell. Let's all work towards less magic, less black box, and more science in this field; we will all benefit from that. Who knows, we might even convince the bench chemist that we are doing something right ;)

So, where is the understanding of underlying patterns, you may wonder? That is a fair question, but I have no grudge in admitting that after my PhD that part has been underrepresented. That will change soon enough, though. Now I can only hope it is on time go get me a Nature or Science paper, required to get tenure (see this discussion).

That's not all I did. I have not discussed the book chapters, the book, the other publications to which I contributed in various ways (doi:10.1186/1471-2105-11-159, doi:10.1093/bioinformatics/btq476). That will come in a more detailed report later.

Finally, I link to thanx Uppsala University for the KoF 07 grant which funded my work in Uppsala.

Wagener, J., Spjuth, O., Willighagen, E., & Wikberg, J. (2009). XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous web services BMC Bioinformatics, 10 (1) DOI: 10.1186/1471-2105-10-279

Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E., Steinbeck, C., & Wikberg, J. (2009). Bioclipse 2: A scriptable integration platform for the life sciences BMC Bioinformatics, 10 (1) DOI: 10.1186/1471-2105-10-397

Spjuth, O., Willighagen, E., Guha, R., Eklund, M., & Wikberg, J. (2010). Towards interoperable and reproducible QSAR analyses: Exchange of datasets Journal of Cheminformatics, 2 (1) DOI: 10.1186/1758-2946-2-5


  1. Hi Egon, thanks a lot for your blog updates - as always very much worth reading.

    I have a question regarding your paper 3 - "defining a standard for detailed exchange of QSAR data": The group of Igor Tetko just set up something very related in Munich, the OCHEM environment ( which allows for storing bioactivity (and other) data as well as applying various modeling methods to the datasets.

    Would that fulfill your criteria on exchanging sufficient information or would more be needed here? I would like to disclose for completeness that I am also co-authoring an (hopefully) upcoming publication on the environment with them; in this sense I am even more interested if OCHEM would need to be expanded in your opinion when it comes to the data availability part. (As Igor said ... OCHEM is never finished, it's always work in progress... so probably even he would be open if changes were required here.)

    It would be great if you find the time to comments on this - thanks a lot! All the best, Andreas

  2. Hi Andreas,

    it might. The second thing that happens on the front page is ask me to create an account. The demo account also asks me to agree with the ToS, which pretty much locks down the source code and the data behind proprietary walls.

    It is currently not clear from the website to me yet what standard they are development, and whether it would fall under the data or under the software aspect of the ToS. But neither is very promising to make it match my criteria, unfortunately.

    Do you have further pointers around what OCHEM wants to do to improve interoperability of various QSAR modeling tools? It would be great to be able to upload a QSAR-ML data set directly into OCHEM, or to download a OCHEM data set as QSAR-ML!