Saturday, April 28, 2012

UML diagrams from the command line

I think I found a nice tool for my Groovy Cheminformatics book! And for Java documentation in general :) There are various tools to draw UML diagrams, such as umbrello, but the XMI format it uses is not easily edited from the command line, or included in books as source...

But I recently ran into UMLGraph on StackOverflow. I gave it a quick try but could not figure out how to get interfaces to show up. But Bob Cross pointed me to the right example:

Tuesday, April 24, 2012

#NBIC2012: the Chemistry Development Kit in NL

The demos of Bioclipse-OpenTox went well today (tomorrow more opportunity to drop by). Yesterday I installed Ambit2 and the ontology server, and created a plugin with DS-OpenTox configuration for the localhost. Nina told me how to add local (ToxTree) algorithms as 'models' and register them with the local ontology server, allowing Bioclipse-DS to find them (this assumes that the first command creates the first model):

    curl -X POST http://yourhost:8080/ambit2/algorithm/toxtreecramer
    curl -X POST http://localhost:8080/ontology -d \ 

But, tomorrow I will also talk about CDK development in the Netherlands. It is important to know that the CDK stands on the shoulders of libraries earlier developed by Christoph Steinbeck, like his CompChem libraries, which goes back to 1997! I cannot stress enough that his and Dan Gezelter's early Open Source work from those days made it possible for me, as a student, to join the scene. It was a huge honor to found the CDK together with them, back in 2000!

Also many thanx to Miguel who is co-author of the abstract, not present at NBIC2012, but whom has contributed a lot of work to the CDK. From within Christoph's group and from Leiden!

The Chemistry Development Kit (CDK) is an Open Source (Lesser GPL) library for cheminformatics, providing key functionality to handle and compute properties of chemical structures. The CDK was founded in 2000 by an international group with a basis in earlier open sources tools, like Jmol and JChemPaint. The CDK is a international, open project, which has had contributions from The Netherlands as early as 1999. It is used in cheminformatics fields like drug discovery, and bioinformatics fields such as metabolomics and toxicology.

This presentation will detail the current functionality of the CDK and how that was used to support methods in published studies, as well as providing an overview of tools that use the CDK as underlying library. Softwares that provide cheminformatics functionality using the CDK include (plugins for) the workflow environments Taverna and KNIME, the graph network analysis software Cytoscape, general tools Bioclipse, rCDK, and many dedicated tools, include MEF and MZmine2, tools used in metabolomics. 

Currently, the CDK is an international, developed project with contributions from many academic and industrial institutes, with core developers at Leiden University, Maastricht University, the European Bioinformatics Institute, the United States, and many others.

Monday, April 23, 2012

#NBIC2012: Bioclipse-OpenTox

Tomorrow and Wednesday at the #NBIC2012 meeting (add your twitter account if your speaking or attending), I will demo Bioclipse-OpenTox during the Application Showcase sessions (all abstracts) (Tue: 10:00-11:15 and 14:50-15:50, Wed: 10:30-11:30 and 14:50-15:50). If you cannot attend (or just do not like Lunteren), watch these screencasts.


Computational predictive toxicology draws knowledge from many independent sources, providing a rich support tool to assess a wide variety of toxicological properties. A key example would be for it to complement alternative testing methods. The integration of Bioclipse and OpenTox permits toxicity prediction based on the analysis of chemical structures, and visualization the substructure contributions to the toxicity prediction. This demo will focus on the application of Bioclipse in predictive toxicology, and show the interactive, visualization, and cloud computing features. There is an additional option for attendees to learn about how they can add their own predictive models.

Willighagen, E.L., Jeliazkova, N., Hardy, B., Grafström, R.C., Nov. 2011. Computational toxicology using the OpenTox application programming interface and bioclipse. BMC Research Notes 4 (487).

Thursday, April 19, 2012

Working against CDK interfaces instead of implementations

For quite some time now, we have been working on factoring out the data module implementation of the interfaces. These classes were originally all we had, even before interfaces were introduced. Since then, the CDK adopted other implementations, like silent and datadebug. In time, more and more code became independent of implementations, and I just finished another set of eight patches so that even qsarmolecular is independent from the implementations. Thus, now you can even calculate descriptors with the silent implementation.

This is the dependency graph I created from this patch:

The data module is found in the middle right, and still a few modules depend on it. These modules, like pdb and libiomd, often have classes extending classes from the data module. These are much harder to make independent.

The patches do a few things, including:

  • make Elements use a custom (immutable) IElement
  • introduce a new fragment module allowing removal of some last dependencies on extra

Sunday, April 15, 2012

Dereferencable InChIs: OpenMolecules RDF

About four and a half years ago, I started OpenMolecules RDF, a spin off from Chemical blogspace (Cb, which is still up and running thanks to Peter Maas!) where I started using InChIs in URIs. My interest came from the dereferencability, the ability to take an InChI and find information about the chemical structure representated by it. Because information about anything is scattered around the internet, and we need something decentralized. Moreover, at the time searching of InChIs with search engines like Google did not work well at all: InChIs were tokenized in inconvenient ways.

Originally, these URIs for InChIs were provided (and still are) by Cb, this July five years ago:

for which soon after a separate domain was instantiated (thanx to Geoff!):

Mind you, OpenMolecules RDF is a decent citizen of the Linked Open Data network, though not much linked to. The ChEMBL-RDF data is, and love to hear if there are other link sets pointing there. On the outlinking side, it points to ChEBI (via Bio2RDF), DBPedia, ChemSpider (for 10k structures), the NMRShiftDB, and Cb itself. This post describes the adding of the link to DBPedia.

In the past few years, I have written up bits on OpenMolecules RDF. The main reference is our chapter in Beautiful Data [1], where I used the URIs for the solubility data. It was later also described in the Linking the Resource Description Framework to cheminformatics and proteochemometrics paper [2] and another book chapter [3].

This blog features a few more use cases, such as the ability to use these URIs to bookmark molecules or to annotate them with tags with Connotea (which resulted in a nice lunch with the Nature people at the time). The link to Connotea is disabled at the moment, though.

At this moment the system still holds, though there is problem in that browsers can put practical limits on URIs length, which limits the maximum size of the InChI. Virtuoso does this too.
  1. Bradley, J. C.; Guha, R.; Lang, A.; Lindenbaum, P.; Neylon, C.; Williams, A.; Willighagen, E. L. Beautifying Data in the Real World. In Beautiful Data; Segaran, T.; Hammerbacher, J., Eds.; O'Reilly Media, Inc.: Sebastopol, US, 2009; Chapter 16.
  2. Willighagen, E.; Alvarsson, J.; Andersson, A.; Eklund, M.; Lampa, S.; Lapins, M.; Spjuth, O.; Wikberg, J. Journal of Biomedical Semantics 2011, 2, S6+.
  3. Guha, R.; Spjuth, O.; Willighagen, E. Collaborative Cheminformatics Applications. In Collaborative Computational Technologies for Biomedical Research; John Wiley & Sons, Inc.: 2011; Chapter 24, pages 399-422.

Saturday, April 14, 2012

CDK regressions in master

A short note about the state of CDK master. As you may have heard, the master has now forked beyond the cdk-1.4.x branch to such an extend that blindly merging even in a git world no longer works; I now have to manually do this, and am experiencing merge conflicts frequently.

Obviously, this forking is for the better of the CDK and master has a number of API improvements. One of those improvements is the removal of the IMolecule and IMoleculeSet interfaces, which did not contains any methods, and were only about semantics.

However, we cannot recommend the master branch for production use. Not just because the API is not stable yet, but also because we still have too many regression; thirteen to be precise (79+15 in master, versus 72+9 in cdk-1.4.x):

(The first column is the CDK version, the second the compile data, the third the total number of unit tests in the suite, followed by the fails and errors in the next two columns. The last, truncated column has PMD warnings.)

Help identifying the source code lines that cause these additional thirteen fails and errors is most welcome!

BTW, this screenshot of SuperNightly also nicely shows work on the smarts module: a few unit tests were fixed (so, some failing unit tests did not in fact reflect bugs), and the 'Sc' parsing bug is fixed. Thanx to Dazhi!

Tuesday, April 10, 2012

"Emerging practices for mapping and linking life sciences data using RDF"

The "Emerging practices for mapping and linking life sciences data using RDF" (doi:10.1016/j.websem.2012.02.003) is now available online, where I contributed a section on the original workflow for creating ChEMBL triples, and contributed to the section about open licensing, referring to CCZero and the Panton Principles. Happy reading!

(Yes, it is indeed an Elsevier journal...)

Saturday, April 07, 2012

Lördags goodies #1: a Chemical Identifier Resolver plugin for Bioclipse

Saturdays is day where I am normally so tired, but my important inbox has gone down sufficiently, that I treated myself with some lördags goodies. So, I hacked up a Bioclipse plugin to interact with Markus Sitzmann's Chemical Identifier Resolver (here are his recent ACS slides). I'll post the code to my GitHub account shortly, and here's the obligatory screenshot:

The matching BSL script:"morphine"))

Also note the script log just above, where I searched for name. This returns multiple hits, but then I get a HTML page, making the command work. For now, I will inspect the content and check if it is HTML, other read it into a structure, and otherwise fail with a BioclipseException.

A typical QSAR study (cito:citesAsAuthority)

I use CiTO to keep track of how the CDK is cited and used, and just looked at a typical QSAR paper. Here are my comments on "Study of indole derivative inhibitors of Cytosolic phospholipase A2α based on Quantitative Structure Activity Relationship", by Lu et al (doi:10.1016/j.chemolab.2011.11.011). Normally, I am fairly short in these reviews which I publish via the CDK Google+ page, briefly describing what CDK functionality is being used. But this time the post became a more substantial review, so decided to put it here too, and use ResearchBlogging which I haven't done in a while.

The paper by Lu et al is typical QSAR paper, with less than 50 compounds, hundreds of descriptors, and some machine learning. They cite the CDK as a free tool to calculate descriptors, but use something else. The article compares PLS, ANN, and SVM, in the typical bad way, by not splitting out the effect of the kernel (RBF) from the regression model, making the comparison pretty uninformative.

If I scanned the paper correctly, they use a single test set, with LOO cross-validation for modeling method parameter estimation. The test set compounds are picked at the outer sides of the end point range, and no information is given on the variance in R2 and Q2 statistics. BTW, these two statistics are surprisingly close to each other (for each method separately). I wonder if that applies to all possible test sets, and some bootstrapping seems in order here.

Also, stepwise MLR was used for descriptor selection, thus prior to statistical modeling, and it seems to me PLS, ANN, and SVR was performed in this subset! Well, that makes the comparison even less relevant, as PLS does not require such prior selection. Moreover, it is know the stepwise MLR easily leads to local minima, not to the most optimal combination of descriptors.

ResearchBlogging.orgLu, X., Ji, D., Chen, J., Zhou, X., & Shi, H. (2012). Study of indole derivative inhibitors of Cytosolic phospholipase A2α based on Quantitative Structure Activity Relationship Chemometrics and Intelligent Laboratory Systems DOI: 10.1016/j.chemolab.2011.11.011

Friday, April 06, 2012

CDK 1.4.9: the changes, the authors, and the reviewers

This quick release does not indicate emergency fixes for CDK 1.4.8, but results more from a timing issue. It addresses some practical problems in our development flow. The earliest (bottom most) three patches have to do with the submission of the renderextra patch, while several others result from development work around the DeduceBondSystemTool. The third part is a fix for when the PubchemFingerprinter is used in a threaded environment. Additionally, there is a small fix in the build system to ensure that the dist-test-large includes all data files and core test classes.

The changes
  • Fixed bond order assignment when using the SilentCOB 1e936d0
  • Double bond orders are not properly assigned when using the SilentCOB fb37eab
  • Added testing of the number of double bonds found (fixes #3514176) a6597e3
  • But the test data should be added too 059da00
  • Added the missing cdk-test.jar (patch by Jonty Lawson) df12e2e
  • Updated PubchemFingerprinter to not use static variables. Allows it to be used in multithreaded scenarios 9bd6b45
  • Removed to unused imports, one of which causes trouble with free Java systems 7c5d18b
  • Added a unit test to check for multithreaded usage in PubchemFingerprint. Tests for bug 3510588 57a9c3f
  • Ensure that the totalBounds object returned is non-null c23740b
  • (renderbasic). Made AverageBondLengthCalculator public 1702f3e
  • (renderbasic). Pass the RendererModel to contained renderers. c59f344
The authors

7  Egon Willighagen
2  Arvid Berg
2  Rajarshi Guha

The reviewers

4  Egon Willighagen 
3  Nina Jeliazkova 
1  Arvid Berg