Monday, June 27, 2011

AMBIT's SMILES depict service

We all know Daylight's Depict service, right? But did you also know the AMBIT version (doi:10.1186/1758-2946-3-18, by IdeaConsult Ltd), which also uses the CDK (Open Source) and Xemistry's CACTVS toolkit (free for academic use)?

git reflog: so, what did just happen?!

Chris Aniszczyk made me aware of git reflog. I am not sure I grasp the full power yet, but I already like the fact that it shows me what I just did. For example, this is my latest output:
98e4cae HEAD@{0}: commit: Added copyright owner line for previous patch
280381d HEAD@{1}: origin/cdk-1.4.x: updating HEAD
bce19a0 HEAD@{2}: am: Fixed potential NPE.
280381d HEAD@{3}: commit (amend): Fixed potential NPE.
534dc09 HEAD@{4}: am: Fixed potential NPE.
45b67e5 HEAD@{5}: origin/cdk-1.4.x: updating HEAD
1fa4bb9 HEAD@{6}: commit: Fixed potential NPE.
cf399c8 HEAD@{7}: commit: Fixed potential NPE.
45b67e5 HEAD@{8}: HEAD~1: updating HEAD
5f8ecf9 HEAD@{9}: am: Fixed potential NPE.

What I did on these steps was to apply part of a bug fix patch by Dmitry (EPO, The Hague). In fact, the patch actually fixed two separate problems. One part fixed a NullPointerException in code by me, and that code looks fine; the other part is in code not written by me, and I cannot oversee the consequences of that patch. A unit test would help, and so would a review by the original author.

So, the goal was to apply part of the patch. So, I downloaded Dmitry's patch from GitHub (see, GitHub Tip: download commits as patches, and the hash in the bug report). Then I applied it to my local repositort (5f8ecf9, see the above list). I undid the patch with 'git reset HEAD~1' (45b67e5) to make the patch unstaged. The I committed the two patch parts separately (cf399c8 and 1fa4bb9). I then rebased on origin/cdk-1.4.x to ensure my cdk-1.4.x branch is up to date (45b67e5, and 534dc09 I guess). Then I signed off the part of the patch that I can review (280381d, and bce19a0?), and finally I updated from the upstream repository once more (280381d) and then applied a patch to add Dmitry in the Copyright header as co-author of this class (98e4cae, see also Making patches; Attribution; Copyright and License.).

Now, apparently, you can edit this log too... but I am not sure why you would go about doing that, nor what effect that will have on the commit history...

Sunday, June 26, 2011

Recover experimental data from a heat map?

Is there a R package (or something similar) available to recover experimental data from a heat map like the one shown here? Like the digitize package does for scatter plots...

I asked this question on BioStar too and people were kind enough to point out that ideally, I could just email the corresponding author and get the data. Ideally, the data would have been available from the supplementary information anyway. It is also pointed out that text-mining is inaccurate, and I would not know the units and/or transformations. Obviously, my bad for not adding these things in the question; they felt besides the point.

Well, I guess it is a win for science that people think Open Data is the norm already :)

My machine-readable ELN

Now with embedded PDFs.

Saturday, June 25, 2011

From the archives: my ICCS 2005 poster

Julio and Gert placed their ICCS 2011 work online, and today I was going through old CDs (see From the archives: Chemical Web, and the CDK in 2004 and Chiral Molecules: how cool is the SEM picture?). I also ran into my ICCS 2005 poster, and because that too was before I started blogging, I never posted it online. So, here it is, based on my thesis:

Chiral Molecules: how cool is the SEM picture?

I just found my student thesis in Organic Chemistry from my Nijmegen education. It's in Dutch, but I'll explore if I can upload this to Radboud University's DSpace. But I could not resist sharing this nice scanning electron microscope picture :) Look at those amphiphiles show a nice chiral ribbon!

This disk also has quite a few raw spectra (as TIFF images). I'll try figure out what to do with those. Uploading as Open Data to ChemSpider is tempting, but I want to make sure I can easily have people download the collection too (read: programmatically).

From the archives: Chemical Web, and the CDK in 2004

I am working my way through an enormous pile of CDs, DVDs, both a mix of RO and RW disks, selecting those I will throw away, like old GParted, KNOPPIX and Debian install disks, as well as a few legal copies of old Microsoft software (like a Win98 boot disk). I also found a disk with two presentations I gave in 2004. They are fun to read. The one I gave at ExemplarChem is a bit sad, as I presented stuff there I developed even before 2004, which is still not common ground today :/

 Also note the mention of DADML, something I did for the Woordenboek Organische Chemie, to standardize the access of remote database... well, let's hope I can find my Qiwi presentation in Washington in 2000 too. Damn... I keep amaze myself. [/sarcasm].

BTW, I also passed a CD with quite a bit of software that was around in the late nineties. Quite a few interesting things. A shame I cannot share this, because it was not Open Source :/

Sunday, June 19, 2011

CDK 1.4.0 release blockers

OK, after an hour or two of happily browsing through our SourceForge bug tracker, and the Nightly reports, I identified 20 release blockers that I like to be fixed before the CDK 1.4.0 release. Some are easy ones, but I may find one or two more later on. Others may report new problems too. But overall: doable.

CDK 1.3.12: the changes

I have uploaded CDK 1.3.12 to SourceForge. This is an important milestone release, as it contains the last bit of CDK-JChemPaint code to render molecules. Now for real :) It also contains the new volume descriptor.

The milestone bit is in the fact that with the addition of this extra bit of CDK-JChemPaint, the CDK 1.4.x series is now feature complete, moving it into freeze mode. This means for the development the following two things (in short): 1. no API changes are allowed, and 2. new functionality requires double reviewing.

This freeze means practically that the next weeks, we'll mostly see small clean up work, partly in the build system, partly in JavaDoc fixes, etc, and hopefully a few bug fixes for the open list of bug reports. It also means that continued development of the CDK 1.2.x series has come to a stop, and that we will likely see a CDK 1.5.0 release soon too, initiating the next development cycle.

What CDK 1.6 will bring? Hopefully, the CDK-JChemPaint editing functionality, perhaps alternative aromaticity models (see the cdk-devel mailing list), and hopefully more code from CDK-based products like AMBIT, PaDEL, ScaffoldHunter, and Craft. The challenge here is to find and then port patches back into the CDK back into the main library.

CDK Module dependencies #3

Jonathan is here with me to work on his fingerprint project. He asked about CDK modules, which we use to control dependencies, within the CDK, as well as from the CDK on top of third-party libraries. I wrote up previously this about it:
Using this modularization we can control the cleanness of our code, and keep it small and extensible. For example, the full CDK jar without third-party libraries is already 16 MB large. With this modularization we can limit the number of dependencies, and allowing people to pick what parts they need. The modular building ensures no unwanted dependencies sneak in.

The last overview of CDK module dependencies is a bit outdated. It is easy to recreate from the source code repository, using BeanShell and Graphviz with something like:

$ export CLASSPATH=jar/jgrapht-0.6.0.jar
$ bsh tools/deptodot.bsh --cdkLibs >
$ dot -Tpng -O

The current master gives this diagram:

(The sinchi module no longer exists, but clearly is still picked up from somewhere :)

It is also worth noting how this modularization is defined. We use JavaDoc for this, and in particular by adding a @cdk.module tag to the class JavaDoc, which is explained in this CDK News paper.

Friday, June 17, 2011

Fast Calculation of van der Waals Volume as a Sum of Atomic and Bond Contributions

I was recently asked about a volume descriptor in Bioclipse, which is not yet available. Jmol can calculate surfaces, so that was my first thought. However, I then ran into a paper from 2003 by Zhao, called Fast Calculation of van der Waals Volume as a Sum of Atomic and Bond Contributions and Its Application to Drug Compounds (doi:10.1021/jo034808o).

The paper presents a very simple mathematical model, which approximates the molecular volume by a sum of atomic contributions, and a three terms to correct for atom-atom overlap, via the number of bonds, and corrections based on the number or aromatic and non-aromatic rings. The paper is clearly written, and the mathematics simple.

One problem with the publication though, are the numbers in the main text. They are wrong. I started of using the coefficients of the equations presented in the paper, but very soon ran into problems when I was writing up unit tests based on the volumes for compounds given as examples. In fact, the numbers in the main text are internally inconsistent. Not good. I believe it is partly caused by rounding, but that does not correct for the differences fully.

Fortunately, the Excel sheet in the supplementary information has the exact numbers, and those are numerically consistent.

The paper has been cited 46 times now, so, a fast volume descriptor seems relevant indeed. I am not sure how fast it will propagate to Bioclipse, as I do not have time soon to update the CDK version of Bioclipse (the major part of which is to ensure the Bioclipse-JChemPaint editor does not get broken, again).

Another thought about this paper, is that it is using the evil aromaticity concept, where the authors forgot to mention when they consider a ring to be aromatic.

Zhao, Y., Abraham, M., & Zissimos, A. (2003). Fast Calculation of van der Waals Volume as a Sum of Atomic and Bond Contributions and Its Application to Drug Compounds The Journal of Organic Chemistry, 68 (19), 7368-7373 DOI: 10.1021/jo034808o

Tuesday, June 14, 2011

Importing Nanotoxicity Data with SPARQL into R for analysis

Not so long ago I wrote about [i]mporting RDF input in R for analysis. I am collecting nanotoxicology data in a Semantic MediaWiki with the RDFIO extension installed (by Samuel), allowing me to SPARQL that data directly from R. There is nothing much structural to visualize at this moment, so I'm skipping the Bioclipse intermediate. I did show some visualization of the data itself in the wiki, earlier this week.

Anyway, release 1.2 of rrdf is on its way, adding a sparql.remote method for running SPARQL queries at remote repositories. It also has a patch by Ryan Kohl, to support CONSTRUCT-like SPARQL queries.

I haven't aligned my wiki with any ontology yet, so the properties have SMW-like resource form, which makes the SPARQL a bit weird looking. Other than that, the code to pull in nanotoxicology data from my data notebook now looks like:

endpoint = ""

query = paste("PREFIX w: ",
 "SELECT ?min ?max ?zeta WHERE ",
 "{ ?inst a w:Category-3AMetalOxides . ",
 "  OPTIONAL { ?inst w:Property-3AHas_Size_Min ?min . }",
 "  OPTIONAL { ?inst w:Property-3AHas_Size_Max ?max . }",
 "  OPTIONAL { ?inst w:Property-3AHas_Zeta_potential ?zeta . }",

data = sparql.remote(endpoint, query)

Which results in a data matrix that looks like (mind you, this matrix is numeric, needing a bit of rrdf 1.3 functionality):
> data
      min max  zeta
 [1,]  15  90    NA
 [2,]  15  90    NA
 [3,]  15  90    NA
 [4,]  15  90    NA
 [5,]  15  90    NA
 [6,]  15  90    NA
 [7,]  15  90    NA
 [8,]  15  90    NA
 [9,]  15  90    NA
[10,]  15  90    NA
[11,]  15  90    NA
[12,]  15  90    NA
[13,]  10 100  34.2
[14,]  30  60 -17.3
[15,]  20  30   1.8
[16,]  15  90    NA

So, now it is time for some PCA.

Sunday, June 12, 2011

CDK-JChemPaint #7: rendering molecules as SVG

A very long time ago, Scalable Vector Graphics promised to revolutionalize images on the web. After initial cool work (including CMLSnap: animated chemical reactions by Peter's group!), things cooled down. There was simply a lack of support in browsers. Things have changed. SVG is much better supported now, and people are starting to use SVG again. Like Noel, visualizing 100 molecules in one blog post.

The new CDK-JChemPaint code has been refactored such that the original code for the core functionality is now independent from the drawing toolkit. And we have two well-developed implementations, one for Swing/AWT (used by the JChemPaint applet), and one for SWT (used by Bioclipse). And there is one that generates SVG too, written by Gileain as a proof of principle.

The code is almost identical to the code for rendering molecules as PNG. We just swap the AWTDrawVisitor for the SVGGenerator:

-renderer.paint(triazole, new AWTDrawVisitor(g2));
+svgGenerator = new SVGGenerator();
+renderer.paint(triazole, svgGenerator);

Additionally, we need to change how we output the results. The below code generate the SVG and the matching HTML snippet:

new File("triazole.svg").append(svgGenerator.getResult())

file = new PrintWriter(new FileWriter(new File("triazole.html")))
file.print("<embed width=\"100\" height=\"100\" src=\"triazole.svg\" />");

The result looks like:

I did have to tweak the SVG generated with CDK-JChemPaint 20, so that the HTML ensures the initial scaling, by removing the width and height in the SVG (patch pending). But this output also nicely shows that there are glitches in the generated code. I hope people are interested in contributing to this part of the CDK-JChemPaint patch!

The full source code of the svgMol.groovy can be found in the Groovy JChemPaint repository.

Saturday, June 11, 2011

CDK 1.3.11: the changes, the authors, and the reviewers

    Hej, wait you sneaky bastard! You just released 1.3.10! You can overdo release often too, you know!
Release 1.3.11 is special. It contains a long list of more than 150 patches, introducing the renderbasic module of the CDK-JChemPaint patch. This new module has an implementation for rendering molecules. Nothing more, nothing less. The JChemPaint applet is based on this code base, an I blogged earlier a few code snippets:
IMPORTANT! While important functionality got included, not everything is there yet. In particular, for Swing/AWT support, we still need to include the renderawt module. That one too, needs some further work. It misses a few unit tests, needs a bit more JavaDoc, and bits of code clean up. In short, not all functionality you can use yet with purely CDK 1.3.11. I should have communicated that more clearly. My apologies.

The Authors
This is the result of hard work from Niels Out, Stefan Kuhn, Arvid Berg, Mark Rijnbeek, Gilleain Torrance and me (and as such, a joint project between the groups in Uppsala, at the EBI, and myself). There are also occasional constributions by others that are not to be forgotten!

The Reviewer
Many thanx to Rajarshi for reviewing the patch, giving good comments, and approving it in the end, despite a few remaining shortcomings. (Bug reports welcome! :)

CDK 1.3.10: the changes, the authors, and the reviewers

Release 1.3.10 is not much different from 1.3.9, as we are seriously converging towards CDK 1.4.0 now, with only the big CDK-JChemPaint renderbasic patch waiting. I'm hoping to merge that in this weekend, and to release the first Release Candidate then. The 2nd edition of the Groovy Cheminformatics book, in fact, is already based on 1.3.10, which I released a few days ago. This release has the following changes, mostly contained bug fixes in the 1.2 series:

  • Fixed dependency on specific molecule impl, so now we use IAtomContainer rather than Molecule fbdb989
  • Added a convenience test to see if a parameter has been registered to the model ad04b69
  • Updated JavaDoc checking to OpenJavaDocCheck 0.8 7bfb28b
  • Copy data files into the right folder of the puredist ae69740
  • Added a test to demonstrate the ClassPathException in bug #3305581 d8c4ca6
  • Test if the descriptor results are the same for two implementations (unfortunately, the nonotify IMolecule extends the data IMolecule, so it does not catch bug #3305581) 0f3f147
  • Factored out a method to test whether to descriptor calculations give the same results 861c304
  • Parameterized method to create water to allow alternative implementations 4d7c6b5
  • Updated code to avoid dependency on specific implementation of molecule object. Now use IAtomContainer 7cc943a
  • Added detection of a Te atom type, found in ChEMBL d114114
  • Added unit tests^Cor Te.3 atom type detection aa79a3f
  • Added note about required atom type perception 54d5ff9
  • Added unit tests to calculate tautomers from a handcrafted IAtomContainer 310cd0c
  • Fixed OJDCheck validation 67d76d0

The Authors
13  Egon Willighagen
 2  Rajarshi Guha
 1  Onkar Shinde
The Reviewers
 5  Rajarshi Guha 
 2  Jonathan Alvarsson
 2  Mark Rynbeek
 2  Egon Willighagen 

Groovy Cheminformatics 2nd edition

Update: the fourth edition is out.

OK, I wrapped up the content, mostly finalized last week, after making a few small changes, created a new cover (rather than using a Lulu template), and uploaded things to Lulu:

New content includes:
  • Section 2.3.3: Molecular Formula
  • Section 2.6: IRings
  • Section 7.3: Graph matrices
  • Chapter 10: Molecular properties (mass, TPSA, XLogP)
  • Chapter 11: InChI
  • Section 15.2: CDK 1.0 to 1.2
  • Appendix A: Atom Type Lists
  • Appendix B: CDK Authors

The full Table of Contents is available as 'preview' on this Lulu page.

Friday, June 10, 2011

Assessed if I could recommend the Mendeley plugin for OpenOffice

Our institute started using a Mendeley group for its publications recently. And a lot of my colleagues are using Word and EndNote. I use neither. My personal workflow includes LaTeX, BibTeX, and since recently BibLaTeX, and CiteULike (all content mirrored to Mendeley). And recent talk by Benjamin, I decided to give the Mendeley plugin for OpenOffice a go (in LibreOffice, in fact). It does what it needs to do. I am not sure yet how to customize the display (the equivalent of, for example, unsrt), but not so worried about that personally. This screenshot shows what my test looked like.

Update: Steve explained in the comments that picking the right CSL style does the job. The list of additions shows a download for BMC Bioinformatics which will also work for other BMC journals like the J. Cheminformatics, I assume. The result after making that CSL the default:

Thursday, June 09, 2011

Plotting RDF data with a Semantic Media Wiki

I was not aware of that earlier, but data you have present in a semantic form in MediaWiki can be plotted using jqplot.

The wiki source for the plot in this screenshot looks like:

{{#ask: [[Category:Measurements]] [[Has Study::{{PAGENAME}}]] [[Has Endpoint::PercentageNonViableCells]]
| ?Has Endpoint Value
| format=jqplotbar
| sort=Has Endpoint Value
| order=ascending
| height=250
| width=600

I am wondering if there is a {{#sparql: equivalent that I can use.

Monday, June 06, 2011

Groovy Cheminformatics 2nd edition soon

In February I released the first edition of my book about writing cheminformatics software in the Groovy language using the CDK. The booklet was thinner than I expected for 72 pages (thin paper), which made the booklet look relatively expensive. Then again, it's not particularly making me rich. As explained before, I hope this will become a source of funding for continued CDK development. Anyways, today I worked hard to address some flaws in the first edition, and making some further tweaks.

The overall experience should be improved: I am now using the geometry package which should address the lack of whitespace near the top of each page, thanx to Jason Brownlee. I also noted that the bibliography styles abbrev and unsrt cannot be combined, and because this TeX StackExchange answer mentioned biblatex which I read about several times now, I decided it was time to dive in. I haven't gone the full biber route yet, and still using the CiteULike group for this book. I also hacked up automatic wrapping of output from the Groovy scripts in the book, which should further clean up the design. I doubt it is up to Jonathan's standards yet, but it will have to do for now.

On the content side, there are also interesting changes, and in particular the new sections and chapters. Per request, a list of all people who contributed to the CDK is added, as well as an overview of all CDK atom types. New material includes a short discussion on the IRing interface, a chapter on the InChI, words about generating tautomers (thanx to Mark for this new code!), examples on how to calculate various graph matrices, molecular formula, and how to calculate XLogP and TPSA properties.

All in all, the booklet now sums up to 104 pages, whereas the first version had 72. But, it's much too late already, and the alarm goes way too early in the morning, so the new edition will not appear online today.

Oh, and thanx to all who bought a copy of the first edition!

Thursday, June 02, 2011

Productivity Tool: search bar for any JavaDoc HTML

While searching for a way to hide certain Java packages from the standard HTML JavaDoc output, I ran into a nifty tool: an extension for Chrome and Firefox (this one is in fact a userscript, which we use in life sciences too):

You can type a query in the search field, and that content will be filtered accordingly. Once the userscript or extension is installed, it will work on any JavaDoc HTML.

Wednesday, June 01, 2011

Bringing OpenSource to the public: OpenTox in Africa

Barry is this week in Africa to demo OpenTox, and took along a VirtualBox appliance with OpenTox REST and ontology servers and Bioclipse preinstalled, and installed things on several machines from local participants. Bringing open source software to potential users is becoming easier every day! Well done to Roman and Nina for creating the appliance and to Barry for getting scientists in Africa into predictive toxicology. (Yes, Africa is very big, but I have been too lazy to look up where he exactly went :)