Saturday, October 30, 2010

CiteULike CiTO Use Case #1: Wordles

Last month I reported a few things I missed in CiteULike. One of them was support for CiTO (see doi:10.1186/2041-1480-1-S1-S6), a great Citation Typing Ontology.

I promised the CiTO author, David, my use cases, but have been horribly busy in the past few weeks with my new position, wrapping up my past position, and thinking on my position after Cambridge. But finally, here it is. Based on source code I wrote and released earlier, the first use case I represent is the Wordle one, which I showed with manual work in February.

Now that all the data is semantically marked up in CiteULike, I can easily extract all paper titles (or whatever is available in CiteULike) for all papers that cite the first CDK paper (doi:10.1021/ci025584y). Using the JSON interface, I have this Groovy script to extract all titles:
import static

culUrl = "";

citotags = [
// there are more, but these are all
// I use right now

papers = [

http = new HTTPBuilder(culUrl)

papers.each { paper ->
  println "# Processing $paper..."
  citotags.each { tag ->
    citation = "$tag--$paper".toLowerCase()
    http.request(Method.valueOf("GET"), JSON) {
      uri.path = "/json/user/egonw/tag/$citation"

      response.success = { resp,json ->
        json.each { article ->
          tripleCount = 0;
          article.tags.each { artTag ->
            if (artTag.startsWith(tag)) tripleCount++
          if (tripleCount > 0) {
            title = article.title
            title = title.replaceAll("\\{","")
            title = title.replaceAll("\\}","")
            println "$title"

The output is two blocks which I can easily copy/paste into Wordle. Now, I think I heard one can actually download the java code, so I am tempted to integrate it later, but for now copy/paste will do fine, after the data handling is mostly automated: with a few lines extra I can make such visualizations for any paper I annotated in CiteULike with CiTO.

The CDK I paper

The CDK II paper

Interesting differences... more statistics will soon follow. See Further statistics on the papers citing the CDK for the kind of analyses I have in mind.

Calculating molecular descriptors with OpenTox

While working during office hours on Oscar, I am also trying to finish up some work left from Uppsala. One such thing is the Bioclipse-OpenTox project (see Using Bioclipse to upload data to an OpenTox server and Oxford, August 2010: eCheminfo Predictive ADME & Toxicology 2010 Workshop). Today I finished calculating molecular descriptor values with OpenTox servers:
// requires an unspecified Bioclipse
// development version

service =
serviceSPARQL =

stringMat = opentox.listDescriptors(serviceSPARQL);

// pick any descriptor
descriptor = stringMat.get(1,1);

  service, descriptor,

The first descriptor happens to be a model for predicting the pKa (see Algorithm or Model: OpenTox API quiz).

Uninformative Blue Obelisk eXchange notification

This email informing me about a newly earned badge does not seem very
informative. No link, no information on the actual badge earned :(

Posted via email from Egon's posterous

Algorithm or Model: OpenTox API quiz

Nina (who still does not seem to blog) wrote up this interesting question, triggered by the OpenTox API ontology:
    Given: A publication, describing specific method of property prediction (not a generic machine learning algorithm). An implementation of this publication.
    Example: pKa. This is a decision tree with SMARTS in the nodes. There is a training set, which could be used in validation.
    - Should it be exposed by OpenTox services as ot:Algorithm or ot:Model ?
    - What is the right way to use / extend Blue Obelisk descriptors dictionary to describe this implementation?
    - Would you classify this method as a descriptor calculation or as a predictive model?

I would say, a ot:Model is a ot:Algorithm, just a comlex one.

The question shows one of the virtues of ontologies: they require us to carefully think about what we say. It is almost as like they put the scholar back into science.

On a different note, can we please start making an Open Data pKa database?!?

Thursday, October 28, 2010

Oscar4 Java API: chemical name dictionaries

Besides getting Oscar used by ChEBI (hopefully via Taverna), my main task in my three month Oscar project is to refactor things to make it more modular, and remove some features no longer needed (e.g. an automatically created workspace environment). Clearly, I need to define a lot of new unit tests to ensure my assumptions on how to code works are valid.

So, what are the API requirements set out? These include (but are not limited to):
  • have reasonable defaults
  • being able to add custom dictionaries
  • easily change the chemical entity recogniser
  • plugin text normalization (see Peter's post on UNICODE)

This week I worked on the dictionary refactoring, and talked with Lezan about the ChemicalTagger and trying to get this based on the newer Oscar code (I think we'll be able to finish that today). So, I cleaned up some code I did in the first week, and introduced a Oscar class providing a Java API to the Oscar functionality.

So, to get started with Oscar in your application, you only need to do:
Oscar oscar = new Oscar(
Map<NamedEntity,String> structures =
    "Ingredients: acetic acid, water."
The ClassLoader is needed because the Oscar class will not generally know how to load custom classes.

You can add additional dictionaries, by implementing the IChemNameDict interface and one or more of IInChIProvider, ISMILESProvider, and ICMLProvider. For example, adding the OPSIN dictionary would extend the above code to:

Oscar oscar = new Oscar(
  new OpsinDictionary()
Map<NamedEntity,String> structures =
    "Ingredients: acetic acid, water."

And, I think the oscar.getChemNameDict() method will be renamed to something like oscar.getDictionaryRegistry() really soon.

Wednesday, October 27, 2010

CDK 1.2 to 1.4 API changes #2: implicit hydrogens

A second API change lies deep in the IAtom interface. To reflect more accurately the meaning of the method, the IAtomType.getHydrogenCount() has been renamed to IAtomType.getImplicitHydrogenCount(), and likewise the setter methods.

CDK 1.2 code
CDK 1.4 code
Yeah, that's a simple one. Just to make clear, in both versions the count reflected the number of implicit hydrogens. The getHydrogenCount() suggested, however, to return the number of all hydrogens attached to that atom, that is, the sum of implicit and explicit hydrogens.

CDK 1.2 to 1.4 API changes #1: creating objects with an IChemObjectBuilder

Later this year (planned) a new stable branch of the CDK library will be released. Time to look at some API changes, to ease migration. In this first post of the series, I will show how the IChemObjectBuilder functionality has changed.

CDK 1.2 code
IChemObjectBuilder builder =
IMolecule molecule = builder.newMolecule();
CDK 1.4 code
IChemObjectBuilder builder =
IMolecule molecule = builder.newInstance(
  builder.newInstance(IAtom.class, "C")

Now, please note that the builder.newInstance() method may actually return null. This is not the case for the DefaultChemObjectBuilder, or the NoNotifiationChemObjectBuilder, but future releases may have dedicated builders that do have such functionality. However, these builder would not supposed to be used for building molecules anyway.

The general patterns of newInstance() calls is that the first argument is the interface for which you want an instance. All further parameters are passed as parameters for the object's constructor. The builder maps the input to appropriate class constructors. To know what parameters you can pass when instantiating an IAtom with the DefaultChemObjectBuilder, you would look at the constructor of Atom. Therefore, we can also call:
IAtom atom = builder.newInstance(
  IAtom.class, "C", new Point2d(0,0)

Tuesday, October 26, 2010

Multiple unit test inheritance with JExample

Two months ago I wrote about JExample (see Specifying unit test dependencies with JExample). At the time, my examples did not include multiple unit test inheritance, but was informed later by @jexample that is possible. I just got time to try it in the Oscar project:
@Test public Oscar testConstructor()
throws URISyntaxException
  Oscar oscar = new Oscar();
  return oscar;

public String testNormalize(Oscar oscar)
throws URISyntaxException
  String input = oscar.normalize(
    "This is a simple input string with benzene."
  return input;

public List testTokenize(
  Oscar oscar, String input) throws Exception
  List tokens = oscar.tokenize(input);
  Assert.assertNotSame(0, tokens.size());
  return tokens;

public List testRecognizeNamedEntities(
  Oscar oscar, List tokens)
throws Exception
  List entities =
  Assert.assertEquals(1, entities.size());
  return entities;

The pattern here is that each test method returns one variable, so that any method depending on two other unit test will have two parameters. The order is defined by the order in which they are given by the @Given clause.

Really free chemistry books #2: @bookboon

About a year ago I wrote about free chemistry books. No, not illegal copied books, but really free books (though, not necessarily Open). Actually, the books I discussed last year, those are out of copyright, as they are old. But, just today I ran into an advertisement for free books by (twitter:bookboon), and these are new books. I have not browsed them yet, but you can download textbook PDFs from these sections: biology, chemistry, chemical engineering, math, nanotechnology, and several more. I do not know about the quality of these books yet, and they author names do not immediately ring a bell. Does anyone have experience with these books?

Saturday, October 23, 2010

The Answer to: Are these organic molecules the same?

Ten days ago I asked my readers if two molecules were the same or not. I guessed they were not, when I was asked Are these organic molecules the same? The people who replied to my post were quite convinced they were, and Peter gave the context of the pub quiz: assumptions may not be correct.

Indeed, I assumed there were hydrogens missing (implicit), and that line corners indicate places where carbons are. But the key to this problem was that I also assumed that the E/Z stereochemistry for the two double bonds were properly defined. Or, more accurately, I assumed that because I was comparing the two molecules, the E/Z stereochemistry for the double bond between the rings was identical in both drawings. We all did.

Under that assumption, these two molecules are indeed not the same. However, if the E/Z stereochemistry is actually not the same for that double bond, ... well, you get the point. Perhaps this was not the best of examples, as it is quite conventional to use 2D coordinates to determine E/Z stereochemistry... we even have a special drawing style to indicate the E/Z stereochemistry is unknown. Then again, how often does the organic chemist really use that.

A more convincing example was also drawn in the pub, and I should have given that one. Peter posted those later. These involve a spiro compounds. Here too, I assumed that the stereochemistry around the spiro carbon was identical. My bad. There was one person in the pub who spotted the problem: David Jessop.

Underlying issue, of course, is those stupid 2D drawings. Jmol has been around for more than 10 years now (and non-free tools too), and we still use 2D drawings... why, oh why? 3D coordinates and explicit hydrogens, that is what our molecular data should be represented with. Henry does this right, over and over again, in his brilliant blog. Well, most of the time anyway. Look for the 'Click for 3D' statements behind the figures, and just give it a try, e.g. in this post on I(CN)7.

BTW, a clear example of McPrinciple #2.

Friday, October 22, 2010

Cb: New Blogs #14

Just a few new blogs since #13 in July:
If you know good chemistry blogs, please contact the author and ask them to email me for inclusion.

Thursday, October 21, 2010

Oscar text mining in Taverna

One of the goals of my project in Cambridge is to make Oscar available as Taverna plugin (source code, Hudson build). I have progressed somewhat, but still struggling with getting the update site working. The plugin actually installs into Taverna 2.2.0, but the activities do not show up. While this is work in progress, and the other project goal is refactoring, a current demo workflow looks like:

Example input would be: This is a list of ethanol, methanol, and 2,4,6-trinitrotoluene.

The plain text input can be linked to the pdf2text SADI service, and the CML is suitable for the CDK-Taverna plugin, which is currently being updated by Andreas, Achim, and Christoph for Taverna 2.2. As soon as the update site is properly working, I will upload a demo workflow to

I guess the first next activity (node in the workflow) will be around the dictionaries, as the OPSIN activity converts only IUPAC names into connection tables. I was told OPSIN parses 97% of the IUPAC names it finds, and when it does, it does almost 100% correct. Want to challenge the code? Use this web service.

Saturday, October 16, 2010

Royce Murray and Caveat Emptor

Derek's blog pointed me to an editorial by Royce Murray Science Blogs and Caveat Emptor (doi:10.1021/ac102628p). He is warning us, science scholars, for blogs. He is accusing bloggers for not being scholarly, not checking facts etc.

He did himself and the journal a big disfavor with this editorial: in his blog he does precisely what he is accusing the blogger of: fail to check facts. Even worse, particularly for the 'Analytical Chemistry' journal, he showed inadequate in analyzing the problem, putting his scholarly skills at questionable levels: he failed to see what 'blogging' is and what it is not, and he failed to ascribe his concerns to the proper source; effectively, he failed to see the difference between correlation and cause-effect for 'blogging' (unworthy to any scholar, particularly if you start complaining). I invite Royce to blog his full analysis of the problem, with proper underlying data, facts, etc, so that I (and others) can explain to him the true factors involved in this problem he is noticing.

The editorial is a sad piece, and an editorial unworthy for the journal.

Actually, the fact that he mentions the Impact Factor is amusing. It must be noted that his editorial will have a huge impact, but not because the writing is any good, but because it is utterly wrong. And that reflects only one thing that is wrong with impact factors.

I strongly suggest Royce to checks his facts before he starts writing. The ethics expressed in the editorial seems only to apply to other scholars.

I you wonder about my strong language. That was triggered by these words from the editorial: In the above light, I believe that the current phenomenon of “bloggers” should be of serious concern to scientists. I consider myself a blogger, not unreasonable giving the fact that I blog, and feel personally attacked. Hence, the title of this post: Royce Murray and Caveat Emptor.
Murray R (2010). Science Blogs and Caveat Emptor. Analytical chemistry PMID: 20939598

Friday, October 15, 2010

Working on Oscar for three months

As Peter announced in his blog, and I tweeted earlier, I have started as postdoctoral research associate in Peter's group at the University of Cambridge, to work the next three months on Oscar, a chemical text mining tool. My tasks will focus on programmatical plumbing instead of method development, and I am aiming at integration with CDK-Taverna (see doi:10.1186/1471-2105-11-159, and which is currently being ported to Taverna 2.2 by Andreas). Sam and Lezan having been working on the refactoring as well, and will help me out with the gory details of the current code.

The source code of Oscar4 is available from this BitBucket project, and you can monitor the code state on this Hudson page. The project I will be working on, is in collaboration with the ChEBI project, and today we met up with various people in the group, and set out some really interesting use cases.

Wednesday, October 13, 2010

Are these organic molecules the same?

Cambridge pubs are not just good for the (Danish) beer, but also for the pub quizzes. Peter asked if the below molecules are the same. I did not think so, but... what do you think? He also asked if they are chiral. We got until tomorrow 20:00 BST.

Saturday, October 09, 2010

Mapping BioStar users onto the world map

Neil is my new bioinformatics hero! (Sorry Pierre :)

BioStar is a Q&A website for bioinformatics, just like the Blue Obelisk eXchange. Neil and Pierre have an ongoing struggle to gain the most karma, requiring Pierre to put in a formal complaint against people posting questions when he is asleep (the whole 3 hours). So, I coined to idea of mapping all BioStar users on a Google Map. Neil picked it up, and had combined his coding skills with the various Open API, Open Standards, and Open Source solution, to come up only hours later with this map. Here are the BioStar users from my region:

Now, who will be my new cheminformatics hero, and make a map for the Blue Obelisk eXchange? ;)

Friday, October 08, 2010

CDK Book in progress

Very much overdue, but still in progress, is my book on CDK programming. I am in love with the writing environment, a mix of make, Groovy and LaTeX, where the code snippets are written in Groovy and embedded into LaTeX (see CDK - The Documentation). The Groovy script is actually run by the build system, allowing me to embed the output too.

In the LaTeX source code I, therefore, have something like:
    The list of supported hybridization types can be listed with:
    listing these types:
refering to a groovy script that looks like:
    #import org.openscience.cdk.interfaces.*;
    IAtomType.Hybridization.each {
      println it
Actually, the above is preprocessed to give the LaTeX view as well as the actual Groovy script run.

Since last year, I have pimped the output a bit, and the above now looks like:

The 100th CDK 2003 paper

We got a winner! Crabtree just published the paper An Open-Source Java Platform for Automated Reaction Mapping (doi:10.1021/ci100061d), and is, according to Web of Science, the 100th paper to cite the CDK 2003 paper (doi:10.1021/ci025584y)!

The paper uses the rendering functionality of the CDK. The authors write:
    The viewer application uses code from the Chemistry Development Kit(31) (CDK) to display graphical representations of the compounds involved in the reactions. The CDK source code was altered to enable the color-coded display of the bonds that were broken or formed during the reaction as shown in Figure 7. In addition, we created a “transition state” molecule that shows the transitory combination of reactant molecules that occurs at a potential energy maximum. The CDK source code was also modified to support the display of the transition state.

The source code is, as the title promises, open source, and available from the armsrc project on SourceForge.

Sunday, October 03, 2010

RDFa support in CiteULike

Last month I reported a A list of things I miss in CiteULike and the developers have not been sitting still. Only one of the things they started to adopt is RDFa in the CiteULike web interface. Of course, you will not see this in the browser, unless you use something like the RDFaDev plugin in Firefox:

Cheers to the CiteULike developers team!