Wednesday, March 30, 2011

Reporting missing atom types...

About a month ago, I blogged about how to report new atom types the CDK has no knowledge about yet. Earlier this week I had a great overview from Julio of a view phosphorus atom types, a PDF looking like this:

(Though the hybridization field is empty, I noticed just now.)

Monday, March 28, 2011

Google Summer of Code 2011: some cool ideas

The Google Summer of Code is a now yearly event where university students (including PhD students) can work on open source project, and get payed fairly well (mentors only get a T-shirt :). A few years ago I was mentor for the KDE project, supervising work on Strigi-Chemistry. Various open source science projects participate again this year (no, no chemistry projects, I'm afraid), but a few come close. Here are my tips of this year:

CDK 1.2.8: the changes, the authors, and the reviewers

CDK 1.2.8 is one more bug fix release in the 1.2 series.

The changes
The changes include another few new atom types and fixes of recognition of earlier defined atom types, further widening the scope of chemistry recognized by the 1.2 series. It also has a few other minor fixes, as listed below.
  • fixed pdb_atomtypes.xml errors mentioned in, including the nigly vdwRadius for GLN.CD 3ab7c25
  • whitespace-only: alignment of TYR-section 5b51e15
  • Added two new authors f72be81
  • Fixed a concurrency error, caused by the use of a static field which was not supposed to be static, as the classes are instantiated just to allow customization 9f54afc
  • Fixed the use of the proper Convertor 3ea9180
  • Added the Co(3+) atom type (fixes #3093644) c368c3d
  • Added the unit test phosphine for bug #3190151 851c1a6
  • Added a unit test for a phosphor without explicit or implicit hydrogens b35f3bf
  • Updated OrderQueryBond in SMARTS & isomorphism matching so that we correctly match when faced with aromatic bonds (rather than just looking at the bond order) 15f64ff

The authors
4  Egon Willighagen 
2  Jules Kerssemakers 
1  Julio Peironcely 
1  Rajarshi Guha 

The reviewers
5  Rajarshi  Guha 
3  Egon Willighagen 

Thursday, March 24, 2011

Supplementary files, publishing, and standards #2

You must read this previous post first.

Now, it is important to realize there are standards at many levels. Open specifications allow people to implement the specification without having to pay fees, run into patents, etc. To me, an Open Specification is something you can take, modify, and propose to the community as new Standard.

Standards themselves are basically something orthogonal (IMHO, not uncriticized): if something is a standard is just the result of the community picking up and using the specification. Something doesn't have to be Open to be a standard, nor does it have to undergone year-long debates (like HTML5). A standard doesn't even have to be fixed to a version (they can be backwards compatible). There are therefore many kinds of standards, including de facto standards.

These distinctions are crucial. Another is that standards rarely cover everything. For example, it is ridicule to talk about Excel as a de facto standard in data exchange in science. Now, the Microsoft formats are 'Open Standards' (they sneaked in, when the world was complaining). But, they are standards at the wrong level for scientific computation: they define a standard container, and not any semantics. That's up to the user: "Hey, why should I waste a column on units... everyone knows we put in temperatures as Kelvin, ummm, Fahrenheit, ummm... Kelcius, or what is it again the eurotrash uses?"

This problem holds for very many fields, and you see it in many different formulation. Or the whole discussion about ScHTML and PDF, on what is more semantic. Or this one: "Oh, let's use a database so that our statisticians can do there work." Been there, done that.

Now, just to end a bit more positively, here's a group of bright, visionary people using the standards at the right levels in this spreadsheet:

Supplementary files, publishing, and standards

The publishing world is slowly changes. Things a small community has been screaming for more than a decade now (and possible before that), that data standards in publishing are inadequate. PDF has not helped (fortunately there are replacement initiatives). Even new journals do not do everything right from the start, but at least there is the effort, such as I discussed in these posts:
This week BioMed Central's Iain asked the community how to put their Open Data initiative into practice. There are some good points in the write up, such as:
    Editors and publishers are acutely aware of the limited pool of peer reviewers who are increasingly called upon to help try and ensure the integrity of the published record. The online availability of research data as a supplementary (additional) files has prompted debate about the role of peer review in this non-written material, and indeed the role of journals in publishing this material.
I can very much relate to this problem, and spent about an hour this morning reviewing Additional files to a paper I was reviewing for BioMed Central. And I had quite a few comments, and overall, the section was inadequate.

I tried to reply in the blog, but my comment was marked as SPAM because it had more than 1000 characters (update: this seems to have been manually fixed now, thanx!). So much for constructive comments :) So, here goes:
    Dear Iain,

    thank you for this interesting and important post! I absolutely agree that some standards need to be set. Scientists have been unable to do this, and publishers can distinguish themselves from competition in doing this right.

    Without going into detail what 'right' is (I have very strong opinions on that :), what is important for BioMedCentral right now, is put the advantages so closely in front of the scientist, they can no longer ignore it, or say 'whatever' (which they do now).

    BMC must therefore demonstrate what this reuse, reproducibility, etc, practically means. So, Goal 0 must be: do something with the 'additional files': process them yourself and 1) associate every single additional file with facts about that file; 2) index them, and create a search engine to search additional files based on their content, *cross* all BMC journals; 3) provide alternative download formats, showing what it means to use Open Standards.

    Open Data is not the goal; it's the means to do science better.

    About 1. Every additional file should have a separate web page (or page section), listing not just size, but also the exact format (MS-Excel 2000, rather than 'Excel'... versioning matters!), metadata present in that file (author, creation data, does it have Macro's defined, etc), and statistics about that file (number of sheets in the spreadsheet, number of filled cells, etc).

    About 2. It is of utmost importance that we can discover this supplementary information, and it must be easy to search for stuff using free text (e.g. I want to find all additional files across all BMC journals that have 'tryptamine' somewhere in the additional file, even if that information is stored in Excel files *inside* zip files. Current technology makes that very easy, such as Strigi.

    About 3. As reuse is the key here, the use of Open Standards are important. This could be stressed by showing that files with Open Standards can easily be interconverted, such as spreadsheets into CSV or HTML tables. Just alternative download formats makes the 'Additional files' more useful, and encourages the authors to ensure that they provide data in the right formats.

Wednesday, March 23, 2011

Tutorials around the OpenTox standard

Just a quick forward. Roman posted this announcement for tutorials around the Open OpenTox Standard:
    The first tutorial will be held Wednesday, 30 March, at 15:00 CET, and aiming at Java developers who want to learn how to build an OpenTox API compliant Web service (some acquaintance with Restlet, and basic knowledge of the OpenTox REST API, is necessary). Participation in these online events involves no registration fee. For more information, as well as to register, visit

Why I like Linux, it's fast! Or, a quick typo fix in the CDK.

Last week there was a short git tutorial, resulting in a small test patch (feel free to ping me, if you like git training for the CDK too). Gaev some commit and review fame to Jules and Gilleain. I ran onto another 'Contructs' typo today, and decided to fix it. Linux is fast, and writing this blog post took more time to fix it:
$ git checkout -b 266-14x-typo cdk-1.4.x
$ grep -ri Contructs src/main | cut -d':' -f1 | \\
  uniq | xargs replace Contructs Constructs --
$ git format-patch -1
Patch report is here.

Tuesday, March 15, 2011

Open Standards (Or: "the long list of things that we weren’t required to do")

Hot on the heels of a presentation I give today in a ToxBank virtual meeting on (Open) Standards (I'll blog the points as soon as possible), I just ran into this great blog post by the makers of the OWL reasoner Pellet (license:GPL), which we use in Bioclipse. It is spot on on what happens if you use Open Standards: "That's notable not only because it took so little time, but also because of the long list of things that we weren’t required to do in order to make it happen". This is not a new message. For example, the Quixote project is doing the same thing in computation chemistry, but important for people to understand: Open Standards (quite like Open Source, and Open Data) are a game changer: they simply collaborating, speeding up development, and changing the way we do things.

Chemometrics with R

I just heard that my supervisor's book Chemometrics with R was released, and I immediately requested our library to get a copy. Ron introduced me to R at a time that most at our department were still using Matlab. In fact, I had be maintaining Matlab scripts used in the departments chemometrics courses, and wrote my own philogenetic tree visualization code for Matlab, using a I-no-longer-remember line notation system for the trees.

This book is recommended to all bioinformaticians too: many former colleagues from my old department are now in fact working in bioinformatics (see also this BioStar answer).

Sunday, March 13, 2011

Unfortunate (CDK) decisions

Sometimes you make decisions which turn out unlucky. Christoph, Dan, and I made one about 10 years ago (see this photo from Dan with his whiteboard where we wrote down our thoughts). The one I am talking about is not bad in itself, but the implementation is bad; bad in the sense that it makes it easy for other toolkits to make a difference in size and performance (see also Quo vadis, CDK?).

The reason why I bring this up, is that I am answering a question on the CCL mailing list on how to convert a XYZ file into an adjacency matrix, and my Groovy Cheminformatics did not have a section on graph matrices yet. And I just ran into the issue that the CDK XYZReader only accepts an IChemFile. Then I realized the book doesn't explain yet why that is either.

It basically comes down that chemical files can contain a wide variety of chemistry. MDL molfiles can contain a single Molecule, but also a set of molecules. Chemical documents in general, think Jmol, can also contain trajectories for cyclohexane (see this nice animation by Bob), etc, etc. And because the CDK is a general chemistry development toolkit, it has to support all. So, we have the IChemFile concept which corresponds to a chemical file, containing a sequence of one or more models, which may be if a single molecules (e.g. geometry optimizations), or multiple models (e.g. MDL SD files). Similarly, each model can contain a variety of objects: reactions, set of molecules, and crystals. Well, you get the point.

Now, at some point, it was decided that being able to convert file formats from one into another might in fact be interesting (think OpenBabel). That's where it went wrong (and if you start git blaming it would not surprise me if I am the center of all evil here). Just to keep it simple, it made sense to have all IO classes at least support ChemFile, because whatever the content, it could always be wrapped in an ChemFile.

So, the answer to the question on the CCL mailing list is a bit more involved than I wanted:
import org.openscience.cdk.*;
import org.openscience.cdk.config.*;
import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.graph.matrix.*;
import org.openscience.cdk.graph.rebond.*;

reader = new XYZReader(
  new File("data/").newReader()
allContent = ChemFileManipulator.getAllAtomContainers( ChemFile())
ethanoicAcid = allContent.get(0)

factory = AtomTypeFactory.getInstance(
for (IAtom atom : ethanoicAcid.atoms()) {
RebondTool rebonder = new RebondTool(2.0, 0.5, 0.5);

int[][] matrix = AdjacencyMatrix.getMatrix(ethanoicAcid)

println "The adjaceny matrix:"
for (row=0;row<ethanoicAcid.getAtomCount();row++) {
  for (col=0;col<ethanoicAcid.getAtomCount();col++) {
    print matrix[row][col] + " "
  println ""
This code gives this output:
The adjaceny matrix:
0 1 1 1 1 0 0 0 
1 0 0 0 0 1 1 0 
1 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 
1 0 0 0 0 0 0 0 
0 1 0 0 0 0 0 0 
0 1 0 0 0 0 0 1 
0 0 0 0 0 0 1 0 

Note that the RebondTool didn't pick up the bond order, but that's not needed for the adjacency matrix anyway.

Getting back on the unfortunate design decision: what would things have looked liked if we had typed our IChemObject reading and writing classes, allowing us to differentiate IChemObjectReader<IAtomContaienr> from IChemObjectReader<IChemFile>. It would have greatly reduced the size of simple programs, or something like the wished-for 50k JChemPaint applet. Right now, pulling in a single reader, will automatically pull in all IChemObject classes. Rather, I would just pull in IAtomContainer, which in itself is too large too. But that's for another blog post.

The key thing to remember is: don't be afraid to add some complexity if that allows you to make other things optional.

New CDK contributors

The new OpenBabel book acknowledges the contributors to the project (deduced from Jörg's thanx). My Groovy Cheminformatics book is about using the CDK, not about the project itself really, but shows many similarities to the OpenBabel book. Edition 1.3.8-0 of my book does not contain a list of contributors to the CDK library. This is being fixed now.

In one go I also updated the AUTHORS list distributed with the source code. The list of individuals who have contributed source code to the project has grown to 64! Here are a few of the new names:
  • Jonathan Alvarsson
  • Saravanaraj N Ayyampalayam
  • Stephan Beisken
  • Dmitry Katsubo
  • Jules Kerssemakers
  • Uli Köhler
  • Scooter Morris
  • Carl Mäsak
  • Peter Odéus
  • Julio Peironcely
  • Syed Asad Rahman
  • Andreas Truszkowski
  • Paul Turner

Friday, March 11, 2011

Pharmaceutical Bioinformatics

The Wikberg group at Uppsala University where I did my first two years of post-docing in Sweden has a (free) course on pharmaceutical bioinformatics. We worked hard on a course book, which after a year got printed. It ended up with 400 pages, 15 chapters, and 3 appendices. It also includes examples on how to do discussed topics in Bioclipse:

The chapters are:
  1. Introduction
  2. Background
  3. Drug discovery and development
  4. Bioinformatics, drugs and omics
  5. Representing molecules in computers
  6. Representation of 3D structures
  7. Molecular descriptors
  8. Sequence analysis
  9. Macromolecular descriptors
  10. Design of experiments
  11. Data analysis
  12. Databases
  13. QSAR
  14. Proteochemometrics
  15. Semantic web
  16. Answers

And three appendices:
  A. Basic probability theory
  B. Basic matrix algebra
  C. Introduction to Bioclipse

The book is currently undergoing proof review and available to course participants only, and should become available in a couple of months for the general public.

Tuesday, March 08, 2011

ToxBank: a data warehouse for (computational) toxicology

Last week I was in sunny Cascais, and in three days experienced -23oC and +18oC. The reason I was there was the kick-off meeting of the EU FP7 cluster SEURAT, which includes 'our' ToxBank project.

Data types we will host include many different types, including my favorite metabolomics. Don't ask me what this will practically mean, but some keywords we already know include RDF, OpenTox, and ToxML. With metabolomics, I hope to squeeze in metabolomics.

And that data warehousing for metabolomics is important was only recently shown be the retraction (via RetractionWatch) of  this Nature paper (doi:10.1038/nature03356). The reason was that it critically depended on conclusions from another retracted paper (doi:10.1021/jf021166h), from J. Agric. Food Chem. in 2009.

In this paper, they identified ten chemicals from arabidopsis: butanoic acid; trans-cinnamic acid; o-coumaric acid; p-coumaric acid; ferulic acid; p-hydroxybenzamide; methyl p-hydroxybenzoate; 3-indolepropanoic acid; syringic acid; and, vanillic acid. I hope I have the links to Wikipedia correct, as this was based on names only, as the paper does not seem to list InChIs or even SMILESes. The ten chemical were identified with HPLC and NMR. No experimental data seems to be given. What NMR data did they base the identification on? I have seen pretty interesting assignments of chemical identity in GC/MS and LC/MS, so was quite disappointed to not see the gory details here.

But fortunately, I could look at the raw data. Yeah, sure! Dream on.

In fact, it seems the characterizations of the 10 chemicals was challenged, causing the authors to look again at their data. Unfortunately, they could not find experimental data anymore. The authors write in the retraction:
    We have been unable to find experimental data that document the actual isolation of butanoic acid, trans-cinnamic acid, ...
Now, readers of my blog I care about raw data (see McPrinciple #1). For example, it was a key feature of our MetWare project. It is not entirely clear to me that they could no longer find the raw data, or whether they were no longer able to correlate their extracted characteristics with the know NMR for those ten compounds. This only strengthens the importance of NMR databases in metabolite identification, something Christoph would only agree with.

I am not sure we will see the bottom of this, and see if the authors could have prevented this retraction. However, I do believe the paper was flawed in the first place: it did not give experimental detail allowing the referees to judge the metabolite identification. The referees failed, as the apparently did not find this aspect important enough to have this data in the paper. And, the journal failed clearly, by not having a good editorial requirement in place around availability of data. This is not specific to this retracted paper, nor of this journal. It's pretty much the community standard, despite many calling for years for better standards, e.g. via minimal reporting standards.

Well, maybe journal editors will soon wake up, and make availability of experimental data in papers of this kind (and any type, IMHO) a community standard, and strong standard, such strong that referees can reject papers of papers do not provide this minimal information.

Why? It would have saved a lot of people from doing the wrong thing. The original paper was cited 54 times (according to WoS) and the Nature paper 52 times (up one since the RetractionWatch post). We're bound to see a few more retractions as a result of this, I guess.

So, where I failed to get MetWare going within the Netherlands Metabolomics Center, let's hope ToxBank does better. But given the list of ToxBank partners, I have no doubt about that.

ResearchBlogging.orgBais, H., Prithiviraj, B., Jha, A., Ausubel, F., & Vivanco, J. (2005). Mediation of pathogen resistance by exudation of antimicrobials from roots Nature, 434 (7030), 217-221 DOI: 10.1038/nature03356

ResearchBlogging.orgWalker, T., Bais, H., Halligan, K., Stermitz, F., & Vivanco, J. (2003). Metabolic Profiling of Root Exudates of Arabidopsis thaliana, Journal of Agricultural and Food Chemistry, 51 (9), 2548-2554 DOI: 10.1021/jf021166h

Java Performance...

Java virtual machines are weird things. They can optimize code on the fly, tuning how it runs the code, depending on the data it processes. That is why it can be faster than C and C++, or even faster then Fortran. Those are all pre-optimized based on the code, not based on the workload. But, at the same time, performance is tricky, and your best friend is a good profiler (like YourKit, which is free for Open Source use).

Reading about Java performance now and then is useful. It's keep you aware of the things that matter and that don't. Via Planet Eclipse, I read a couple of posts from the Java Persistence Performance blog. And today I learned that the JIT compiler can in fact remove the threading support in Hashtable when only one thread is used.

Monday, March 07, 2011

Bioclipse-CDK-ChEMBL-SPARQL-Chemometrics mashup paper published

Our (peer-reviewed) Bioclipse-CDK-SPARQL-Chemometrics mashup paper was just published online (doi:10.1186/2041-1480-2-S1-S6) as part of the SWAT4LS 2009 supplement, where we use the CC-SA-BY-licensed ChEMBL database as example data source, based on a presentation I gave in Amsterdam back then.

This is a Figure from our paper, but the other papers are very interesting too, and are worth checking out!