Monday, November 29, 2010

Adding a new dictionary to Oscar

Say, you have your own dictionary of chemical compounds. For example, like your companies list of yet-unpublished internal research codes. Still, you want to index your local listserv to make it easier for your employees to search for particular chemistry you are working on and perhaps related to something done at other company sites. This is what Oscar is for.

But, it will need to understand things like UK-92,480. This is made possible with the Oscar4 refactorings we are currently working on. You only need to register a dedicated dictionary. Oscar4 has a default dictionary which corresponds to the dictionary used by Oscar3, and a dictionary based on ChEBI (an old version) (see this folder in the source code repository).

Adding a new dictionary is very straightforward: you just implement the IChemNameDict interface. This is, for example, what the OPSIN dictionary looks like:
public class OpsinDictionary
implements IChemNameDict, IInChIProvider {

  private URI uri;

  public OpsinDictionary() throws URISyntaxException {
    this.uri = new URI(

  // the URI is somewhat like a namespace
  public URI getURI() {
    return uri;

  // there are no stop words defined in this
  // dictionary
  public boolean hasStopWord(String queryWord) {
    return false;

  // see hasStopWord()
  public Set getStopWords() {
    return Collections.emptySet();

  // it has the name in the dictionary if the name
  // can be converted into an InChI
  public boolean hasName(String queryName) {
    return getInChI(queryName).size() != 0;

  // this dictionary can return InChIs for names
  // so, it implements the IInChIProvider interface
  public Set getInChI(String queryName) {
    try {
      NameToStructure nameToStructure =
      OpsinResult result = nameToStructure
          queryName, false
      if (result.getStatus()
        Set inchis = new HashSet();
        String inchi = NameToInchi
            result, false
        return inchis;
    } catch (NameToStructureException e) {
    return Collections.emptySet();

  public String getInChIforShortestSMILES(
    String queryName)
    Set inchis = getInChI(queryName);
    if (inchis.size() == 0) return null;
    return inchis.iterator().next();

  // since names are converted on the fly, we do
  // not enumerate them
  public Set getNames(String inchi) {
    return Collections.emptySet();
  public Set getNames() {
    return Collections.emptySet();
  public Set getOrphanNames() {
    return Collections.emptySet();
  public Set getChemRecords() {
    return Collections.emptySet();
  public boolean hasOntologyIdentifier(
    String identifier)
    // this ontology does not use ontology
    // identifiers
    return false;

Now, you can implement the interface in various ways. You can even have the implementation hook into a SQL database with JDBC, or use something else fancy. The dictionary will be used at various steps of the Oscar4 text analysis workflow.

Mind you, the refactoring is not over yet, and the details may change here and there.

Your comments are most welcome!

Saturday, November 27, 2010

Uppsala Status Report

As you know, my post-doc in Uppsala ended. It was a good time, and it was great collaborating on Bioclipse with Ola, Jonathan, Arvid, and Carl. I would have loved tighter integration with the work of Maris and Martin, but that was limited to one joined paper (in press). I thank Professors Jarl Wikberg and Eva Brittebo for allowing me to continue my research at their department, and hope this is not the end of the collaboration yet.

Like with new year, the end of a contract is a good time to reflect on ones accomplishments. It's been a bit delayed, but as you know, I already in my next project in Cambridge, and will start in January with yet a longer term position in predictive toxicology (more on that soon). This makes this a really crowded period, on top of birthdays, Sinterklaas, x-mas, and sorts.

My Research
As you might know, my research interest lies in understanding molecular properties and their applications in larger molecular systems. This can be how small molecules pack in crystals, finding patterns in properties (QSAR-like work), etc. Because the underlying methods are useful in many domains, you see applications in various too, including drug discovery, metabolomics, etc. These methods involve statistics and cheminformatics, primarily, which is clear from my publications on method development in chemometrics and cheminformatics. You will also have seen that visualization is a very important tool here, as our numerical validation can easily mislead even a trained scientist.

How Uppsala fits in
About 30 months ago, I got an offer to join the Bioclipse team to work on the cheminformatics features of the workbench. It was already using the CDK, so the project was a tight match with what I did in the past. Additionally, there were plans to integrate R, and while the latter is partially implemented, that part was unfortunately not completed by the group yet. I believe this is a crucial aspect, and without it the large-scale impact of Bioclipse will be severely reduced.

Bioclipse is positioned as a workbench to use third-party libraries, web services, databases, etc, and has done so very successfully (doi:10.1186/1471-2105-8-59). It speaks many Open Standards, and already incorporates various important Open Source libraries for life sciences research, including the aforementioned CDK (doi:10.2174/138161206777585274, doi:10.1021/ci025584y), but also Jmol, JChemPaint, BioJava (doi:10.1093/bioinformatics/btn397), and others. Using these libraries it has rich visualization means for life sciences data, including molecules and protein sequences. The latter, of course, is directly related to the proteochemometrics research done in Wikberg's group. Recently, Bioclipse adopted scripting functionality, making it a perfect tool to share life sciences computation, just like Taverna (doi:10.1093/bioinformatics/bth361) and KNIME.

Where I hoped to do some research in proteochemometrics, events lead me into different areas, which I explained in Why you have not heard me much about chemometrics recently....

But, Bioclipse provides me with the tools I need to take molecular chemometrics (doi:10.1080/10408340600969601) forward.

So, what has this resulted in, besides a number of unsuccessful grant applications? We're still counting, but two book chapters, a book on pharmaceutical bioinformatics, one proceedings paper, five research papers, seven oral presentations at international meetings, and a ACS conference on RDF in chemistry. Oh, and tons of Open Source code, of course. (I'm at the edge of collapsing; I did that as student, lost a year, but learned a lot about myself ... these results I have worked very hard for; I am not a miracle worker. And I have to disappoint people occasionally, as things do not work out how I expected them to be. My apologies for that.)

I will not describe all in detail now, but focus on a few things around what I made my research in Uppsala: semantic cheminformatics, which I believe to be a key concept of where cheminformatics must be going. The first paper resulted from a collaboration with Johannes, a medical researcher at the Ludwig-Maximilians-Universität in Germany (full reference at the end of the post). This work provides an alternative to SOAP, which has a better solution to asynchronous computing that the polling approaches now commonly used. A XMPP-based service just reports back when it is done, so that you do not have to ask all the time. Makes sense to me. We made the platform available to Bioclipse and Taverna, and demonstrated the technology with applications in life sciences, including (QSAR) descriptor calculation, and susceptibility for seven known HIV protease inhibitors.

This work stresses that if we really want to, we can significantly improve scientific computing. It's very much like what Peter concluded this week: "None of this is rocket science - it’s purely a question of will". This is what I have being trying to show in the past few years. The disuse of accurate scientific computing is a deliberate choice. Making your cheminformatics research irreproducible is a choice, and a bad one too. There can be acceptable reasons, but the choice would be bad nevertheless (I hope that distinction is clear: you can have valid reasons to do something intrinsically wrong. You will be forgiven, and be encouraged to change your behavior.) Many people on the Blue Obelisk community are laying out the foundations and show cases, hoping to make it easier for others to change behavior. I think we have been quite successful there.

Anyways... on to the second paper. As said, Bioclipse is the platform that can bring these new cheminformatics methods to the desktop. The new and improved Bioclipse 2 (see citation below) adds one important new feature: scripting. My work in this paper focuses on doing making sure the cheminformatics library was properly integrated, continued development of JChemPaint (yet unpublished, and in collaboration with the EBI, but see for example this blog post), and helping Ola and other to properly use the CDK in their applications (MetaPrint2D (doi:10.1186/1471-2105-11-362), etc.). The impact of this work goes far beyond the papers on which I am author, though not every reviewer will understand that, unfortunately. This work is really the plumbing, it's the development of the measuring machines to do the job, the development of a STM device to actually get going.

The third paper, also listed at the end, is about defining a standard for detailed exchange of QSAR data. It defines what information is needed to reproduce a set of QSAR descriptors, including the input, and using a descriptor ontology which we published about before (doi:10.1021/ci050400b). This project can be the seed of a public repository of QSAR data, where it will be clear what is meant, and how the data can be used. If you are interested in setting up such a public repository, please contact me or Ola.

That leaves me to the work that I have initiated in the group: the use of RDF technologies (I do hope all VR reviewers are listening). RDF provide a lingua franca for data exchange in life sciences, and the meaning of words is provided by sharing dictionaries (ontologies). Bioclipse has been extended to speak RDF, and we developed various applications based on it. A proceedings previews the effort, while the paper is in print in the new Open Access Journal of Biomedical Semantics. Of course, you can also read much about this topic in this blog.

RDF is going to change bio- and cheminformatics in ways the XML has been unable to do. Various papers are currently in preparation to provide detailed uses case and related research. I am very excited about this technology which further improved interoperability and reproducibility in cheminformatics. Should you care about that? Yes, because by using these good practices, research will be easier to interpret, conclusions judges, and as such, we can focus on the underlying chemistry in much more details, instead of looking at noise which many current cheminformatics literature is doing. (Ouch, that's a bold statement indeed. True? Well, without reproducibility it is hard to tell. Let's all work towards less magic, less black box, and more science in this field; we will all benefit from that. Who knows, we might even convince the bench chemist that we are doing something right ;)

So, where is the understanding of underlying patterns, you may wonder? That is a fair question, but I have no grudge in admitting that after my PhD that part has been underrepresented. That will change soon enough, though. Now I can only hope it is on time go get me a Nature or Science paper, required to get tenure (see this discussion).

That's not all I did. I have not discussed the book chapters, the book, the other publications to which I contributed in various ways (doi:10.1186/1471-2105-11-159, doi:10.1093/bioinformatics/btq476). That will come in a more detailed report later.

Finally, I link to thanx Uppsala University for the KoF 07 grant which funded my work in Uppsala.

Wagener, J., Spjuth, O., Willighagen, E., & Wikberg, J. (2009). XMPP for cloud computing in bioinformatics supporting discovery and invocation of asynchronous web services BMC Bioinformatics, 10 (1) DOI: 10.1186/1471-2105-10-279

Spjuth, O., Alvarsson, J., Berg, A., Eklund, M., Kuhn, S., Mäsak, C., Torrance, G., Wagener, J., Willighagen, E., Steinbeck, C., & Wikberg, J. (2009). Bioclipse 2: A scriptable integration platform for the life sciences BMC Bioinformatics, 10 (1) DOI: 10.1186/1471-2105-10-397

Spjuth, O., Willighagen, E., Guha, R., Eklund, M., & Wikberg, J. (2010). Towards interoperable and reproducible QSAR analyses: Exchange of datasets Journal of Cheminformatics, 2 (1) DOI: 10.1186/1758-2946-2-5

Monday, November 22, 2010

Installing Oscar

About half-way my Oscar project now. I blogged about the Oscar Java API, the command line utility, and the Taverna plugin (all in development). Also, David Jessop joined in, boosting the refactoring. The meeting with the ChEBI team last week was great too. We worked out details for the use cases, involving Oscar and Lezan's ChemicalTagger.

Here follow some install instructions to get going. Please give things a try, even though we are under heavy development. You can monitor the stability of the code via these Hudson pages for oscar4, oscar4-cli, ChemicalTagger, and others. This continuous building of software should be set up for any scientific code.

The follow the below instructions, you will need a working environment with a tool to process zip files (or equivalent), Java DK, and Maven. Having wget makes things easier. Oh, and you need a working internet connection.

The pattern is otherwise the same for all above tools. I will demonstrate the process with oscar4-cli.

Ubuntu and Debian users can use:
$ sudo aptitude install unzip maven2 \\
  openjdk-6-jdk wget
Downloading the source
The source code for the above tools is all hosted on BitBucket, using the Mercurial version control system. However, there is no need to worry about that, because BitBucket provides source drops. At this page you can download a .zip file for the command line utilities, which you can unzip with your favorite tool.

If you have wget installed you could do from the command line:
$ wget
$ unzip
which downloads and unzips the latest version in the repository. (This is where monitoring Hudson comes in; there you can check if there are failing unit tests.)

Compiling the code
With the source code locally installed, it is same for Maven to come into action. Again, I have no clue how to do this on Windows (other than with Cygwin, which every Windows users should have installed, if a virtual machine with a full Linux is no option), but maven should run the 'assembly:single' target. Or, from the command line:
$ cd oscar4-cli/
$ mvn assembly:assembly
$ cp target/oscar4-cli-4.0-SNAPSHOT-jar-with-dependencies.jar \\
That should do: the created oscar4-cli-4.0-SNAPSHOT.jar can be used now to run the command line utilities. Actually, it should also do fine for using the Oscar4 Java API.

Provide feedback
We most welcome feedback, or any kind. Feature requests can be posted here, and bug reports here. You can also send email to the oscar3-chem-developers mailing list.

Update: mvn assembly:assembly should be used instead of mvn assembly:single.

More fails... (aka: no VR grant awarded)

I failed to get a VR grant in 2010. The arguments are interesting:

My competence
    The applicant had published 15 papers in mostly low impact journals.

True, the top journals in my field (chemometrics, cheminformatics) do not have very high impact factor, because the field is less eager to add 100+ citations to each journal paper, nor is the field know to be popular enough for Nature, Science, etc.

    Two of these are highly cited.

Indeed. I recently blogged about that. Mind you, 46 citations is not highly cited, even though it exceeds the impact factor of Nature and Science.

    This is quite impressive for such a young scientist (PhD in 2008).

But, of course, that does not matter. It's surely not about impression, right?

    His PhD work (in Netherlands) and his postdoctoral work (in Uppsala) is actually all on the same project ...

This is where the reviewers show some disrespect, I believe. Apparently, they have not taken it as one of the responsibilities to actually check what I have done. My PhD work was partly in the UK (Cambridge), and I have done postdoctoral work in the Netherlands and Germany too.

On the same project? Well, depends on how you look at it. Surely, cheminformatics, QSAR, statistics, etc, is all the same. Same for crystallography, NMR, etc. One big pile of science. Again, I feel the reviewers took their responsibility of reviewing very narrow.

    ... and with the same collaborators.

Wow... that's impressive, right? And I was always thinking that international collaboration was positive. But apparently not if you have long term, successful collaborations. WTF??

    The role of the applicant in relationship to Prof X and the other developers is not clear.

OK, I should have made it clearer how the other scientists are involved.

    However the backgrounds is definitely adequate for the suggested project but the applicant lack in independency.

(Carefully transcribed.)

This is an interesting point, and nicely outlines how the current academic system works. As post-docs you are forced to hop around from one funded project to another, hoping to get funding. Until you do, you are working on other PIs project with predefined topic.

Project quality
    The main focus of the project is software development of Bioeclips in collaboration with X and others.

No, if you read the proposal, the project is about statistical method development, and Bioclipse (not Bioeclips) is used as platform to make it look like Excel so that the average scientist understands it. That distinction is difficult, even for scholars.

    The application is mainly about managing errors in observations and processing, annotation and propagation of these.

Indeed! Well copied from the proposal's abstract.

    Expected outcome is identification of processing errors, and potentials for improvement in the data handling.

The reviewers got it almost right. I have not written up clearly enough that the main improvement is finding the source of the error, which we all know is the (biological) experiment and the average scholar inadequacy to do data handling (think Excel).

    Although the development might be of real importance the application does not show a significant scientific component, neither from a computational not from a life science perspective.

This quite puzzles me, as we had very strongly written all over this proposal: metabolomics, metabolomics, metabolomics! With applications including metabolite identification, with experimental partners.

Have they actually read the proposal?

Project quality

    The background and competence of the applicant should ensure success.

So, why not fund me? Read on...

    The applicant requires compliance of users and buy-in from scientific community.

I guess my work is not cited enough to show that my work is actually used. Anyone using the CDK here?

    Although some indication that this will happen is provided...

I guess this reflects to the international collaborations I listed in the proposal.

    ... this is not ensured

Therefore, rejected. Scores: bra (2 out of 5) and låg (2/5).

Saturday, November 20, 2010

Why you have not heard me much about chemometrics recently...

A casual reader my not know the background of the title of my blog. A bit over five years ago, when I started this blog, I defined chemblaics:
    Chemblaics (pronounced chem-bla-ics) is the science that uses computers to address and possibly solve problems in the area of chemistry, biochemistry and related fields. The general denomiter seems to be molecules, but I might be wrong there. The big difference between chemblaics and areas as cheminformatics, chemoinformatics, chemometrics, proteochemometrics, etc, is that chemblaic only uses open source software, making experimental results reproducable and validatable. And this is a big difference with how research in these areas is now often done.

Later, I also identified molecular chemometrics (doi:10.1080/10408340600969601) when I reviewed important innovation in the field, which has, IMHO, a strong overlap with chemblaics. Any reader of my blog will understand that I see semantic technologies play a very important role here, as Open Standard for communication Open Data between the bench chemist and the data analyst using Open Source allowing others to reproduce, validate, and extend that work. Some have identified the possibilities the internet brings over 10 years ago, while the use of semantic computing goes back even further.

And what have the big publishers done? Nothing much yet. Not the old, not the new. There are project ongoing, and there is a tendency (BioMed Central starts delivering Open Data, Beilstein Institute spits RDF for a few years now, Royal Society of Chemistry has Project Prospect), but most publishers are too late, and not investing enough with respect to their yearly turn over. This is particularly clear if you realize that citizen and hobby scientists can innovate publishing more effectively than those projects (really!). Anyway, I do not want to talk about publishing now, but it is just so relevant: I am not a publisher, but publications are the primary source of knowledge (not implying at all that I think that is the best way; it is not).

Instead, I am a data analyst, a chemometrician, a statistician, a cheminformatics dude, or a (pharmaceutical) bioinformatician, depending on your field of expertise. Really, I am a chemblaics guy: I apply and develop informatics and statistics methods to understand chemistry (and biology) better.

During my PhD it became painfully clear that current science is horribly failing, in many ways:
  • Firstly, we are hiring the wrong people because we care more about a co-authored pipetting paper in Nature, than ground-breaking work in the J. Chem. Inf. Mod. (what journal?! Exactly my point!).
  • Secondly, we have our most bright scientists (the full time professors, assuming that some have been hired for the right reasons) spend most of their time on administrative work (like proposal writing, administrating big national/EU projects).
  • Thirdly, we spent million (in whatever currency) on large projects which end in useless political discussions instead of getting science done.
  • Finally, all knowledge from laborious, hard work is placed in PDF hamburgers and lost to society (unless you spend multibucks to extract it again).

There are likely several more, but these three are the most important to me right now.

So, in the past years after finishing my PhD research which was on data mining and modeling molecular data, I have spend much of my time on improving methods in chem- and bioinformatics to handle. Hardly anyone else was doing it (several Blue Obelisk community un-members as prominent exceptions), but someone has.

Why I have been doing this? Well, without good, curated data sources it is impossible to decide why my (or others) predictive models are not working (as good as we want them to be). Is this relevant? I would say so, yes! The field is riddled with irreproducible studies of which one has no clue how useful they are. Trust the authors who wrote the paper? No, thank you, but I rather verify: I am a scientist and not a cleric. Weirdly, one would have expected this to be the default in cheminformatics, where most stuff is electronic and reproducing results should be cheap. Well, another fail for science, I guess.

So, that explains why I have not recently done so much in chemometrics. Will I return? Surely! Right about now. There is already a paper in press where we link the semantic web to (proteo)chemometrics, and more is to follow soon.

One example, interestingly, is pKa prediction, which has seen quite a few publications recently, yet experimental pKa data is not available as Open Data. Why?? Let me know if you have any clue. Yet, pKa prediction seems to be important to drug discovery, as it gets an awful lot of attention (400+ papers in the past 10 year, of which 50+ in 2010!). But this is about to change. Samuel and I are finishing a project that greatly simplifies knowledge aggregation and curation, as input to statistical modeling. We now have the tools ready to do this fast and efficiently. Right now, I am entering curated data at a speed of about 3 chemical structures a minute. That means, given I need a break now and then, that I enter create a data set of reasonable size in a few days. Crowd-sourcing this, the a small community can liberate data from literature in a few days.

This will have a huge impact on the cheminformatics and QSAR communities. They will no longer have any excuse for making their data not available. There is no argument anymore that curation is expensive. This will also have a huge impact on cheminformatics and chemical data vendors. Where Open Source only had moderate impact so far (several software vendors have already joined the Open Source cheminformatics community), this will force them to rethink their business model. Where they could hide behind curation where it came to text mining initiatives (like Oscar, on which I am currently working), with cheap, expert curation knowledge building at hand, they will be forced to rethink their added value.

The impact on the CDK should be clear too. We no longer depend on published models for ALogP, XLogP, pKa, etc, predictions. Within a year, you can expect the CDK project to release the tools to train your own models, and make choices suitable for your user base. For example, you can make more precise models around the structures your lab works on, or more generic models with large screening projects. Importantly, the community will provide an Open Data knowledge base to start from. Using our Open Standards, you can plug in your own confidential data and make mixed, targeted models.

Is this possible with the cheminformatics of the past 30 years? No, and that's the reason why I have been away from chemometrics for a while.

Thursday, November 18, 2010

Oscar4 command line utilities

One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs, as well as two user-oriented applications: a Taverna 2 plugin, and command line utilities. The Oscar4 Java API has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command line utilities.

Most people will be mostly interested into the full Oscar4 program, to extract chemical entities. Oscar3 was also capable of extracting data (like NMR spectra), but that is not yet being ported. The OscarCLI program takes input, extracts chemicals, and where possible resolves them into connection tables (viz. InChI).

To extract chemicals from a line of text (e.g. "This is propane.", you do:
$ java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  This is propane.
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3
For larger chunks of texts it is easier to route it via stdin, for which we can use the -stdin option:
$ echo "This is propane." | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3

That way, we can easily process large plain text files (output omitted):
$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \

If you prefer RDF output, for further integration, use the -output text/turtle:
$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  -stdin -output text/turtle

This returns RDF using the CHEMINF ontology like:
@prefix dc:  .
@prefix rdfs:  .
@prefix ex:  .
@prefix cheminf:  .
@prefix sio: .

  rdfs:subClassOf cheminf:CHEMINF_000000 ;
  dc:label "propane" ;
  cheminf:CHEMINF_000200 [
    a cheminf:CHEMINF_000113 ;
    sio:SIO_000300 "InChI=1/C3H8/c1-3-2/h3H2,1-2H3" .
  ] .

We can, however, also use Jericho to extract text from HTML pages, made available with the -html option, and pulling in a Beilstein Journal of Organic Chemistry paper with wget:
$ wget -qO- | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  -stdin -html

This will return 271 chemical entities recognized in the text, matching 48 unique chemical structures.

Wednesday, November 03, 2010

TTT it is

The fact that Piet Hein said it, gives it an extra (fourth) dimension. Things Take Time.

But, as Peter indicated, we are getting there in cheminformatics. We see the commercial entities experimenting and contributing to Open Source cheminformatics and the Blue Obelisk has reached critical mass a few years ago. Next year is the year of Open Source cheminformatics on the desktop ;)

We are not their yet, and Piet Hein makes a truely wise choice here. My association with Piet Hein is Piet Hein the Dutch sailor, not Piet Hein the Danish scientist and author who gave us the quote. Semantic chemistry is one of those areas where Open Source cheminformatics is doing really well.

Things do take time, but after more than 15 years of Open Source chemistry, it is time to harvest. Peter has a few crops in mind.

Meanwhile... finding a nice, permanent academic position educating students in modern cheminformatics... TTT :(