Pages

Sunday, June 07, 2015

Dr. J. Alvarsson: Bioclipse 2, signature fingerprints, and chemometrics

Last Friday I attended the PhD defense of, now, Dr. Jonathan Alvarsson (Dept. Pharmaceutical Biosciences, Uppsala University), who defended his thesis Ligand-based Methods for Data Management and Modelling (PDF). Key papers resulting from his work include (see the list below) one about Bioclipse 2, particularly covering his work on plug-able managers that enrich scripting languages (JavaScript, Python, Groovy) with domain specific functionality, which I make frequent use of (doi:10.1186/1471-2105-10-397), a paper about Brunn, a LIMS system for microplates, which is based on Bioclipse 2 (doi:10.1186/1471-2105-12-179), and a set of chemometrics papers, looking at scaling up pattern recognition via QSAR model buildings (e.g. doi:10.1021/ci500361u). He is also author on several other papers and we collaborated on several of them, so you will find his name in several more papers. Check his Google Scholar profile.

In Sweden there is one key opponent, though further questions can be asked by a few other committee members. John Overington (formerly of ChEMBL) was the opponent and he asked Jonathan questions for at least an hour, going through the thesis. Of course, I don't remember most of it, but there were a few that I remember and want to bring up. One issue was about the uptake of Bioclipse by the community, and, for example, how large the community is. The answer is that this is hard to answer; there are download statistics and there is actual use.

Download statistics of the Bioclipse 2.6.1 release.
Like citation statistics (the Bioclipse 1 paper was cited close to 100 times, Bioclipse 2 is approaching 40 citations), download statistics reflect this uptake but are hardly direct measurements. When I first learned about Bioclipse, I realized that it could be a game changer. But it did not. I still don't quite understand why not. It looks good, is very powerful, very easy to extend (which I still make a lot of use of), it is fairly easy to install (download 2.6.1 or the 2.6.2 beta), etc. And it resulted in a large set of applications, just check the list of papers.

One argument could be, it is yet another tool to install, and developers are turning to web-based solutions. Moreover, the cheminformatics community has many alternatives and users seem to prefer smaller, more dedicated tools, like a file format converter, like Open Babel, or a dedicated descriptor calculator, like PaDEL. Simpler messages seem more successful; this is expected for politics, but I guess science is more like politics that we like to believe.

A second question I remember was about what Jonathan would like to see changed in ChEMBL, the database Overington has worked so long on. As a data analyst you are in a different mind set: rather than thinking about single molecules, you think about classes of compounds, and rather than thinking about the specific interaction of a drug with a protein, you think about the general underlying chemical phenomenon. A question like this one requires a different kind of thinking: it needs one to think like an analytical chemist, that worries about the underlying experiments. Obvious, but easy to return too once thinking at a higher (different) level. That experimental error information in ChEMBL can actually support modelling, is something we showed using Bayesian statistics (well, Martin Eklund particularly) in Linking the Resource Description Framework to cheminformatics and proteochemometrics (doi:10.1186/2041-1480-2-S1-S6) by including the assay confidence assigned by the ChEMBL curation team. If John would have asked me, I would have said I wanted ChEMBL to capture as much of the experimental details as possible.

Integration of RDF technologies in Bioclipse. Alvarsson worked on the integration of the RDF editor in Bioclipse.
The screenshot shows that if you click a RDF resource reflecting a molecule, it will show the structure (if there is a
predicte providing the SMILES) and information by predicates in general.
The last question I want to discuss was about the number of rotable bonds in paracetamol. If you look at this structure, you would identify four purely σ bonds (BTW, can you have π bonds without sigma bonds?). So, four could be the expected answer. You can argue that the peptide bond should not be considered rotatable, and should be excluded, and thus the answer would be two. Now, the CDK answers two, as shown in an example of descriptor calculation in the thesis. I raised my eyebrows, and thought: "I surely hope this is not a bug!". (Well, my thoughts used some more words, which I will not repeat here.)

But thinking about that, I valued the idea of Open Source: I could just checked, and took my tablet from my bag, opened up a browser, went to GitHub, and looked up the source code. It turned out it was not a bug! Phew. No, in fact, it turned out that the default parameters of this descriptor excludes the terminal rotatable bonds:


So, problem solved. Two follow up questions, though: 1. can you look up source code during a thesis defense? Jonathan had his laptop right in front of him. I only thought of that yesterday, when I was back home, having dinner with the family. 2. I wonder if I should discuss the idea of parameterizable descriptors more; what do you think? There is a lot of confusion about this design decision in the CDK. For example, it is not uncommon that the CDK only calculates some two hundred descriptor values, whereas tool X calculates more than a thousand. Mmmm, that always makes me question the quality of that paper in general, but, oh well...

There also was a nice discussion about chemometrics. Jonathan argues in his thesis that a fast modeling method may be a better way forward at this moment, and more powerful statistical methods. He presented results with LIBLINEAR and signature fingerprints, comparing it to other approaches. The latter was compared with industry standards, like ECFP (which Clark and Ekins implemented for the CDK and have been using in Bayesian statistics on the mobile phone), and for the first Jonathan showed that LINLINEAR can handle more data than regular SVM libraries, and that the using more training data still improves the model more than a "better" statistical method (which quite matches my own experiences). And with SVMs, finding the right parameters typically is an issue. Using a RBF kernel only adds one, and since Jonathan also indicated that the Tanimoto distance measure for fingerprints is still a more than sufficient approach, which makes me wonder if the chemometrics models should not be using a Tanimoto kernel instead of a RBF kernel (though doi:10.1021/ci800441c suggests RBF may really do better for some tasks, but at the expense of more parameter optimization needed).

To wrap up, I really enjoyed working with Jonathan a lot and I think he did excellent multidisciplinary work. I am also happy that I was able to attend his defense and the events around that. In no way does this post do justice or reflect the defense; it merely reflects that how relevant his research is in my opinion, and just highlights some of my thoughts during (and after) the defense.

Jonathan, congratulations!

Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E. L., Steinbeck, C., Wikberg, J. E., Dec. 2009. Bioclipse 2: A scriptable integration platform for the life sciences. BMC Bioinformatics 10 (1), 397+.
Alvarsson, J., Andersson, C., Spjuth, O., Larsson, R., Wikberg, J. E. S., May 2011. Brunn: An open source laboratory information system for microplates with a graphical plate layout design process. BMC Bioinformatics 12 (1), 179+.
Alvarsson, J., Eklund, M., Engkvist, O., Spjuth, O., Carlsson, L., Wikberg, J. E. S., Noeske, T., Oct. 2014. Ligand-Based target prediction with signature fingerprints. J. Chem. Inf. Model. 54 (10), 2647-2653.