Saturday, June 26, 2010

Program for the RDF symposium at the American Chemical Society fall meeting

It is my great pleasure to present the full symposium program for the RDF session at the American Chemical Society at the Boston meeting in August:

I am excited that on Monday afternoon Eric Prud'hommeaux will present the work of the LODD working group to the chemistry community. The symposium contains three half day sessions with topics on computing, ontologies, and applications, all chemistry oriented. The goal of the meeting is to get together people using RDF technologies in chemistry, and the list of talks from around the world shows that this goal has been reached. The program is diverse and exciting, and I am very much looking forward to meeting all participants to discuss challenges and cool solutions.

People interested in joining, can sign up to the meeting mailing list. Besides that the webpage is in XHTML+RDFa, the source is also available on GitHub (well, you really download the source code anyway), allowing people to happily fork, make changes, and perhaps make the page as triple-dense as is possible.

I am also keen to have mastered some jQuery skills, and the Abstract links on the webpage use jQuery to show and hide them.

Critical mass for Open Notebook Science wikis by prepopulation with RDF data

One of the problems with Open Data, - Source, and Standards (ODOSOS) is to get critical mass: for the project to move forward, you need both a good user community and an active developers community. The first is the crucial reward for the latter: people using your work is the incentive. Sometimes, a project does not succeed in doing that; Rich was questioning if his ChemPedia project built up enough mass.

This also applies to wikis. WikiPedia clearly has enough critical mass, but is struggling with scaling, which is another problem you can encounter. Building a dedicated, domain specific wiki allows you to do things the way you like. For example, the can be used to build up the knowledge base around a specific problem. For example, HIV drugs (see HIVdb as an example knowledge base). The wiki could contain information you need for your data analysis. The developers team will likely be small, but the data will be highly curated. You are basically both the developer and the user community. Hopefully, in good Open Notebook Science manner, things will grow, and others will join in on building the knowledge base.

Semantic MediaWiki bridges the human readable wiki content, with machine readable RDF content. This wiki does require some additional annotation, but at the same time makes important facts machine readable. The Bioclipse Wiki is using it, though we are not yet taking advantage of the RDF exposure.

Now, returning to the critical mass aspect we started with. Samuel, who did a project with me on reasoning with Prolog in Bioclipse, is now doing a Google Summer of Code project (is there any significance in the fact that three Bioclipse developers are student of have been mentor in the GSoC? :) within the Semantic MediaWiki project. In particular, one of the things he was just working on, is the import of RDF into a wiki.

Oh goodies! That is pretty cool, don't you think? Just pull together RDF from your field of interest, drop it on the wiki, and you have your initial wiki the experts can continue to curate, add facts, etc. He already told me about his plans to add a SPARQL end point for the Semantic MediaWiki software.

Monday, June 21, 2010

All CrystalEye data available as PDDL

I do not think I have seen this earlier, but I do not visit the
homepage regularly. But congratulations to the CrystalEye team for
releasing the data explicitly as PDDL! I cannot stress enough how useful it is if you add a statement like this to your public domain data. Well done!

Sunday, June 20, 2010

Looking at your statistical models...

I do not think I have ever blogged the paper that played an important role in my thesis (doi:10.1021/ci990038z); research of one of the papers in my thesis, started with the hypothesis proposed therein. The paper had a really good idea; but, unfortunately, it did not contain the data to support the hypothesis. That gets me to one important lesson I learned: a QSAR data set of less than 100 molecules is not enough to make untargeted statistical models.

The paper reads quite nicely, and the results are clear: by combining spectral types, the RMSEP goes down. Good! Lower prediction errors; that's what we all want. So, a M.Sc. student of mine set off, but after about half a year, he was still unable to make statistically good models. He used bootstrapping to 'prove' it was not his fault: there was not enough data for the method to learn the underlying patters. Hence my above lesson. My student went on with larger data sets, and laid out the foundation of what later became the paper on using NMR spectra in QSPR modeling (doi:10.1021/ci050282s). Now you understand why QSAR is missing.

So, if those results are so clear, then why does it not work? As said, the data set was too small for pattern recognition methods to see what was going on. The RMSEP numbers just came out nicely; however, if we had only made the below plot, if would have warned. But I failed to do that at the time. Lesson learned: do not just look at the data, but also look at the model. And look really means looking with your eyes at graphical representations of that model. The plot:

The numbers in this plot are hidden in tables in the paper. The RMSEP values earlier mentioned are calculated from those. From the plot, you can see that the test data consisted of 5 compounds; the training set contained 37 compounds; all are congenerics, and do not span a high diversity. Now, the plot shows five models: black is COMFA; orange is based on experimental IR spectra; red, green, and blue are models where two types of representations are combined. From the RMSEP values it can be seen that combining representation improves the RMSEP values. That's what you want, and sort of makes sense.

Now, I did not make this plot until I started writing up the paper, and tried to figure out why the QSAR data set did not work. My eyes opened wide when I saw the orange dots! Anti-correlation! WTF?!?! I mean, we are looking at a plot visualizing the predicted versus the experimental activity... Actually, the others are not really convincing either, are they? Looking at the predictions for compounds with experimental values around 1.0-1.5 (if you really want to know the unit, read the article), the pattern is pretty much anti-correlated too. Thinking about it, it seems the RMSEP is mainly reflecting the error of the left most compound, the one with a experimental activity of about 0.3.

Clearly, the orange model is hopeless, but the others are not really better. Now, the paper actually makes statements comparing the various combinations of representations, but, in retrospect and looking at this plot, I wonder if the green model is really different from the blue or red models.

Since then, I always make these kind of plots, just to see what my model is like. Since then, I distrust papers that only show RMSEP, Q2, or other quality statistics. Now, the tricky part is, you need those statistics if you want to automate model selection; the variance on those model quality statistics is actually so high (see also my other post today), that you must carefully validate that model selection too, visually of course.

I have been long thinking what to do with these observations. I did not dare publish them in my thesis; I did not dare write a letter to the editor. Perhaps I should. But even writing up this blog makes me feel uncomfortable. Besides the fact that I might be wrong, I also do not like to point out mistakes (IMHO); particularly, when those are published in a respectable journal. I was fooled by the statistics too (and was already well trained), so I cannot comment on the authors overlooking the issue. Or the reviewers! Or the community at large. Also, I do not know what the fate of this paper should be. The idea is quite interesting, even though the published results do not support it. Not shown here, but the bootstrapping results show that the apparent slight improvement is merely a numerical artifact, just happening by chance, based on luckily selecting the test compounds; the data is just insufficient in size to draw any conclusion.

Comparative Spectra Analysis (CoSA): Spectra as Three-Dimensional Molecular Descriptors for the Prediction of Biological Activities Journal of Chemical Information and Modeling, 1999, 39 (5), 861-867 DOI: 10.1021/ci990038z
Willighagen, E., Denissen, H., Wehrens, R., & Buydens, L. (2006). On the Use of 1H and 13C 1D NMR Spectra as QSPR Descriptors Journal of Chemical Information and Modeling, 46 (2), 487-494 DOI: 10.1021/ci050282s

QSPR modeling with signatures

I had to dig deep to find posts on QSAR modeling. There are quite a few on QSAR in Bioclipse, but that focuses on the descriptor calculation. In a quick scan, I could only spot two modeling posts:
Given the prominent place QSAR has in my thesis, this is somewhat surprising. Anyway, here is some more QSAR modeling talk.

Gilleain implemented the signature descriptors developed by Faulon et al (see doi:10.1021/ci020345w; I mentioned the paper in 2006), and the CDK patch is currently being reviewed. With some transformations, the atomic signatures for a molecule can be transformed into a fixed-length numerical representation: [70:1, 54:1, 23:1, 22:1, 9:9, 45:2]. This string means that atomic signature 70 occurs once in this molecule and signature 9 occurs nine times. At this moment, I am not yet concerned about the actual signature, but just checking how well these signature can be used in QSPR modeling.

Rajarshi's fingerprint code provides a good template to parse this into a X matrix in R:

For my test case, I have used the boiling point data I used in my thesis paper On the Use of 1H and 13C 1D NMR Spectra as QSPR Descriptors (see doi:10.1021/ci050282s). Some of this data is actually available from ChemSpider, but I do not think I ever uploaded the boiling point data. This constitutes a data set with 277 molecules, and my paper provides some reference model quality statistics; that way, I have something to compare against. Moreover, I can use my previous scripts to do the PLS modeling (there are many tutorials online, but you can always buy an expensive book like the one shown on the right, if you really have to), (10-fold) cross-validation (CV), and 5 repeats of random sampling.

I strongly suggest people interested in statistical modeling to read this interesting post from Noel: whatever test set sampling method you use, you must do some repeats to learn about the sensitivity of your modeling approach to changes in the data set. Depending on the actual sampling approach, you might see different sizes of variance, but until you measure it, you will not know. For my application, these are the numbers:

# source("pls.R")
Read 277 items
   rank=66     LV=42     R2=0.987     Q2=0.921  RMSEP=31.37
   rank=66     LV=42     R2=0.983     Q2=0.924  RMSEP=12.405
   rank=65     LV=42     R2=0.985     Q2=0.949  RMSEP=38.503
   rank=63     LV=42     R2=0.983     Q2=0.948  RMSEP=36.981
   rank=65     LV=42     R2=0.986     Q2=0.923  RMSEP=21.49
   rank=64     LV=42     R2=0.983     Q2=0.91  RMSEP=17.759
   rank=64     LV=42     R2=0.983     Q2=0.921  RMSEP=17.062
   rank=66     LV=42     R2=0.986     Q2=0.94  RMSEP=40.311
   rank=66     LV=42     R2=0.982     Q2=0.927  RMSEP=13
   rank=68     LV=42     R2=0.986     Q2=0.929  RMSEP=16.23

I know 42 is the answer to the universe, but 42 latent variables (LVs)?!? Well, it's just a start. A more accurate number of LVs seems to be around 15, but my script had to make the transition from the old pls.pcr package to the newer pls package. And I have yet to discover how I can get the new package to return me the lowest number of LVs for which the CV statistic is no longer significantly different from the best (see my paper how that works). Actually, I have set the maximum LVs to consider to 1/5th of the number of objects (which is about the accepted ratio in the QSAR community); otherwise, it would have happily increased.

However, the five repeats nicely show the variance in the quality statistics, R2, Q2, and root mean square error of prediction (RMSEP). From the numbers, a model with Q2 = 0.94 is not better than one with Q2 = 0.93 (and I have seen the variance quite some larger). Bottom line: just measure that variability, and put it in the publication, will you??

Anyway, what we all have been waiting for: the prediction results visualized (in black the CV predictions; in red the test set predictions):

Well, there is still much work to do, and you can expect the result to get better. Part of statistical modeling is to find the source of variance, and I have yet to explore a few of them. For example, what are the effects of:
  • creating signature from the hydrogen-depleted graph
  • effect of tautomerism (see this special issue)
  • effect of the height of the signature
And there are so many other things I like to do. But this will do for now.

Thursday, June 17, 2010

My OpenTox Workshop contribution: The Lost Slides

During the nice presentations at the recent OpenTox Workshop, I noted that My OpenTox Workshop contribution: Linking explicit and implicit knowledge was lacking two slides. The slides should have had screenshot of the excellent Bioclipse applications Ola Spjuth has written in the area of computational toxicology. But here they are.

Metabolic Fate
The metabolic fate of molecules can be predicted with the MetaPrint2D feature (see also the main MetaPrint2D page). The feature currently uses one specific method, but the immediate feedback you get while drawing molecules, can basically use any model. The currently used model is developed in a collaboration of Ola with Sam Adams (whom we all know for his JNI-InChI library) at Cambridge University and Lars Carlsson at Astra-Zeneca.

This screenshots shows the visual feedback in the (new) JChemPaint editor in Bioclipse, and on the right we see one interesting Bioclipse feature in action: the cheat sheets. These cheat sheets are inline help documentation which guides the reader through the functionality. But, unlike mere help, cheat sheets can be very interactive and perform some tasks itself, making it easier for the reader to see what was supposed to happen.

L. Carlsson, O. Spjuth, S.E. Adams, R.C. Glen, S. Boyer, Use of Historic Metabolic Biotransformation Data as a Means of Anticipating Metabolic Sites Using MetaPrint2D and Bioclipse, accepted in BMC Bioinformatics.

Structural Alerts, etc
Structural alerts are one method to signal the scientist that the molecule under study needs some more attention (see for example doi:10.1016/j.mrrev.2008.05.003). It helps him decide to continue to look at that particular structure, or to move on. Ola also developed a decision support plugin.

One of the cool features is that, in good Bioclipse habits, deliver a pluggable architecture. This practically means, that anyone can add their own decision rules; those can be added as local software, or as services on a central, institute specific server. The results in this screenshot show that a Bursi AMES data-based model estimates that this molecule is mutagenetic.

O. Spjuth, L. Carlsson, M. Eklund, E.A. Helgee, S. Boyer, Integrated decision support for assessing chemical liabilities, In preparation

Tuesday, June 15, 2010

I’m just a sucker with no self esteem

Don't let anyone fool you: The Offspring is really just talking about Science:

When she’s saying that she wants only me
Then I wonder why she sleeps with my friends
When she’s saying that I’m like a disease
Then I wonder how much more I can spend
Well I guess I should speak up for myself
But I really think it’s better this way
The more you suffer
The more it shows you really care
Right? yeah yeah yeah

Really! Just read the whole text:

Late at night she knocks on my door
Drunk again and looking to score

Now I know I should say no
But that’s kind of hard when she’s ready to go
I may be dumb
But I’m not a dweeb
I’m just a sucker with no self esteem

Thursday, June 10, 2010

CDK-JChemPaint #6: rendering atom numbers

I have made a few new CDK-JChemPaint patches in the past two days, the latest being patch 15. With the help from Gilleain, all rendering parameters are now using the new API, as explained earlier in this series.

Additionally, the API to work with rendering parameters is now much simpler. The previous posts did not really explain how to tune parameters, so here goes. One important thing to realize, is that a rendering parameter can only be changed if the generator that defines it has been registered. To see what parameters belongs to what generator, for which you can use the script discussed in post #3.

Atom Numbers
In some situations you like to draw atom numbers. This can be done by replacing the BasicAtomGenerator by an AtomNumberGenerator in the script given in post #1:
List generators = new ArrayList();
generators.add(new BasicSceneGenerator());
generators.add(new BasicBondGenerator());
generators.add(new AtomNumberGenerator());

This would result in an image like this:

Now, we also might want to give those numbers a color, to make them stand out a bit. Orange, perhaps :) This is where rendering parameters come in. To the code from post #1, after the instantiation of the renderer, we add:
// tune parameters
model = renderer.getRenderer2DModel();
The output then looks like:

Atom Numbers and Symbols
But you can also render both element symbols and numbers. Then, clearly, you just add both generators:
List generators = new ArrayList();
generators.add(new BasicSceneGenerator());
generators.add(new BasicAtomGenerator());
generators.add(new BasicBondGenerator());
generators.add(new AtomNumberGenerator());
But, in order to have the label and the symbol not overlap, we define an offset (Thanx to Sebastian, of MetFrag fame, for the feature request!):
  new javax.vecmath.Vector2d(10,10)
Then it gets to look like:
This last full example will be available from GitHub shortly.

Update Steffen asked in the comments if it is possible to just color the atoms by element type. CDK-JChemPaint patch 15 does not allow that, but adding that feature is easy enough, and the patch will be part of the next release. Use this configuration:
generators.add(new BasicSceneGenerator());
generators.add(new BasicBondGenerator());
generators.add(new AtomNumberGenerator());
And these parameter settings:
model = renderer.getRenderer2DModel();
This give you for triazole: