Saturday, August 29, 2009

Reminder: my talk in Frankfurt on Monday; Want to meet up?

Quick and short reminder about my Open Knowledge: Reproducibility in Cheminformatics with Open Data, Open Source and Open Standards talk on Monday. The session is great anyway, with other talks from Cameron, John and someone from Berlin on a Open Access HTS system (which reminds me to talk about the Open Access and that the term is tainted).

I still have a free program, other than I want to see Google Wave in action (and while I have receive my invitation, I have not received a login account yet). There is a potentially interesting talk about Second Generation Small Molecule Therapeutics by 15:00. But no plans otherwise for the afternoon and/or evening.

If you like to talk about CDK, Bioclipse and/or the Blue Obelisk movement. Or about my talk on Open Data, Open Standards and Open Source (ODOSOS) in chemoinformatics.

If you happen to be around the Frankfurt Westend campus. In building 4, I think, the Hörsaalzentrum, where the conference is. Please let me know if you like to meet up. I hope to be online :), but no promise on that... should work at a Uni location, not? Let's see... This is how to ping me, and don't worry about redundancy.

Email: egon.willighagen at gmail dot com
IRC: #cdk at
Twitter: egonwillighagen
Identica: chemblaics
Blog: just leave a reply to this message

Saturday, August 22, 2009

PLoS ONE and Chemical blogspace: About no Impact yet

Journals in chemistry are pretty well fixed. JACS, Angewandte Chemie are clear leaders. Nature and Science if you have something that will attract many scientists. For the rest many smaller journals exist more dedicated at particular research areas.

PLoS ONE is a new journal that changes the way science is published: it publishes anything that is scientifically sound and does not make any judgement on impact and lets the community deal with that. Cameron Neylon recently had him taped to discuss article-level metrics used at PLoS ONE (see also this).

And, PONE (as they affectionately call it) seems to be steadily growing to, at least, become a BIG publisher. Clearly, not dedicating yourself to a small discipline helps. And the IT we have had around for the past 10 years make this large scale publishing possible. The impact of a paper becomes clear through those article level metrics.

Finding interesting papers, however, may be a bit more difficult. There are dedicated RSS feeds listed at the front page:

And I recently subscribed to the Chemistry feed (RSS).

One of the sources taken into account for the article-level metrics is, and you may be aware that Chemical blogspace is using the same software. However, us ~60 active have not been paying attention this PONE feed. Well, there have appeared only 84 papers yet in this subsection:

... but only one has been cited in Chemical blogspace, which is a bit disappointing:

So, what are your reasons you do not read this journal yet?

I have spotted one paper which I will soon read and review: How Large Is the Metabolome? A Critical Analysis of Data Exchange Practices in Chemistry (doi:10.1371/journal.pone.0005440).

Friday, August 21, 2009

Bioclipse and SPARQL end points #2: MyExperiment

RDF and SPARQL are two really useful Open Standards. Bioclipse-RDF is a plugin for Bioclipse that provide RDF functionality, among which using remote SPARQL end points.

The MyExperiment team has set up an excellent RDF front end. For example, this is my MyExperiment account in RDF. The storage gets updated once every day (at this moment), but I'm sure that will become more often in the future. The SPARQL end point allows us to make any query against the database that their ontologies support. The above query showed up 132 workflows when I ran it today.

Now, so far I have been using Gist to share Bioclipse scripts and I wrote some Bioclipse GUI elements for downloading such gists. To annotate these gists, Delicious has been used, and a listing of Bioclipse scripts can be found under the tags bioclipse and gist.

MyExperiment also allows to share workflows, but originally only for Taverna. A recent change, however, made it possible to share other types of workflows too. And, MyExperiment itself also allows all the annotation which we may want to do.

Now, using the Bioclipse-RDF functionality, I can query the MyExperiment database and use that information do to stuff. If this stuff is a Bioclipse script, then I can just download it, as the download link of a workflow is part of the RDF too, as we will see.

Querying a SPARQL end point
As we have seen in the first article of this series, the RDF manager his a method to query a remote SPARQL end point. The complexity is mostly in formulating the SPARQL (and this one happens to be available as workflow on MyExperiment too:

This is worsened by the fact that JavaScript does not have a type of multiline Strings, so the backslashes at the end of the lines are JavaScript syntax and not part of the SPARQL. To simplify the SPARQL, I will show below the SPARQL only, and not the Bioclipse script wrapping as is done in the above code snippet.

List all Taverna 2 workflows
Listing all Taverna 2 workflows, as shown in that earlier snippet, is done with the SPARQL:

This query asks for a ?workflow and its ?title, and the workflow ?type must be of Class ContentType as defined in the mebase namespace, and we want to know the ?typetitle of that content type, because we are filtering that using a regular expression to contain "Taverna 2". Well, if you cannot follow this, just google for SPARQL, and run one of those tutorials which are abundantly present on the web.

Finding tags used to annotate workflows
To list all tags which have likely to do with metabolomics, I can do:

And I can also list all workflows that are tagged like this. Because I could not get string matching to work, I used the tag's URI instead:

All MyExperiments Users in Sweden
I was also interested in all MyExperiment Users in Sweden, and again, a simple SPARQL tells me where they live:

Finding Duncan and Pierre
Very easy to find users, such as Duncan:

Or Pierre, who has not listed where he lives:

My workflows
Given a user, it is also easy to get the workflows he owns. Again, I am using my URI instead of combining with a search for my account, because the MyExperiment SPARQL end point is not particularly fast:

Earlier in this series:
  1. Bioclipse and SPARQL end points #1: DBPedia

Monday, August 17, 2009

The Social Web does not wait for Bioclipse... here comes Google Wave

Google Wave is going to change the web. It's the end of Google Docs, and likely many other services. It's going to be Open Source and being a Wave Provider will not be restricted to Google. This will be enough to make this a success. If you haven't watched the full video demo yet, please have a look yourself:

I left some thoughts and notes on FriendFeed:

Bioclipse enters the social web

The Open Notebook Science Solubility project in particular is keen on sharing results using the Social Web. Last week I reported about the plugin I wrote to access the data on FriendFeed:

When someone asked last week on Taverna mailing list about a Twitter node, I was surely interested. Though this can hardly be called core research, I, fortunately, had to test the new Bioclipse SDK :)

So, I hacked up a Twitter plugin for Bioclipse in no time using JTwitter (license:LGPL), to allow sending tweets to my Twitter account (but not yet my account):

Or, as copy/pastable script:

And you can see it really hit Twitter here and in this screenshot of my Choqok client:

Sunday, August 16, 2009

Bioclipse and SPARQL end points

Last week, there was a very interesting thread on the DBPedia mailing list, on using Java for doing remote SPARQL queries. This was one of the features still missing in bioclipse.rdf. Richard Cyganiak replied pointing the code in Jena which conveniently does this and which bioclipse.rdf is already using anyway. Next, Fred Durao even gave a full code example relieving me from any further research, resulting in sparqlRemote() now implemented in the rdf manager:
> rdf.sparqlRemote(
"select distinct ?Concept where{[] a ?Concept } LIMIT 10"
[[], [],
[], [],
[], [],
[], [],
I reported earlier two example SPARQL queries for chemistry, which can now be rewritten as Bioclipse scripts:


Thursday, August 13, 2009

Making Bioclipse Development easier: the New Manager Wizard

Today, Jonathan, Carl, Arvid and I made writing managers for Bioclipse a bit easier. Plug-in development Eclipse in itself is already tricky to learn, and the use of Spring by the Bioclipse managers is not helping. And because very soon two new people will be starting with writing a new manager rather soon, we thought it was time to lower the activation barrier a bit.

The basic file structure of an Bioclipse manager looks like:
| `-- spring
| `-- context.xml
|-- plugin.xml
|-- .classpath
|-- .project
`-- src
`-- net
`-- bioclipse
`-- foo
`-- business
That is twelve files which need to be just right. I used to copy/paste from an earlier (simple) manager.

But we know and understand that setting up this framework is even more challenging if you have not done this at least 10 times before. So, today we implemented a New Wizard (source available from this Git repository: bioclipse.sdk).

It just asks you a project name:

and a few other settings:

Installing the Bioclipse SDK
Installing this new plugin is fairly easy, and we have set up an Update Site at Just add this as Update site in Eclipse 3.4.x (which is still required for Bioclipse2). It depends on the JDT and PDE, which you will likely already have installed being part of the default Eclipse RCP release.

Go to the Software Updates in the Help menu:

and pick Add Site.... Enter the aforementioned update site as shown here:

Then, select the Bioclipse plugin:

After you hit Install and Eclipse install the fews tens of kBs of the plugin, the plugin should show up in your installation, like it did in mine:

Implementation Details

Writing the plugin was a challenge to me, and I am happy we were doing this in a hackaton. The Bioclipse-QSAR project already had a New Project wizard, but not for a new Plug-in Project. Some things are just slightly different then. For example, it turned out that creating a .classpath cannot be done in the regular way (it never showed up), and I had to dig up some internal code of the PDE. Actually, our current implementation is still using a few internal classes because of this:
IClasspathEntry[] entries = new IClasspathEntry[3];
String executionEnvironment = null;
entries[0] = ClasspathComputer.createJREEntry(executionEnvironment);
entries[1] = ClasspathComputer.createContainerEntry();
IPath path = project.getProject().getFullPath().append("src/");
entries[2] = JavaCore.newSourceEntry(path);
Ideas are most welcome on how to clean up this code, and not make it use internal, non-exported classes. For the Java source files and even the MANIFEST.MF we are using templates, though I have seen this file being created programmatically too.

I'm sure we'll run in some needed plumbing here and there, but that's what update sites are for, not? Release soon, release often is an Open Source concept that works well in the Eclipse world.

"LAST CALL: XEP-0244 (IO Data)"

Today I received this email, which is a milestone for the XMPP (aka Jabber) work Johannes, Ola and I have been working on as SOAP alternative using the intrinsically asynchronous XMPP as transport protocol instead of HTTP as SOAP commonly does (see Next generation asynchronous webservices):
    This message constitutes notice of a Last Call for comments on XEP-0244 (IO Data).

    Abstract: This specification defines an XMPP protocol extension for handling the input to and output from a remote entity.


    This Last Call begins today and shall end at the close of business on 2009-09-01.

    Please consider the following questions during this Last Call and send your feedback to the standards @ discussion list:

    1. Is this specification needed to fill gaps in the XMPP protocol stack or to clarify an existing protocol?

    2. Does the specification solve the problem stated in the introduction and requirements?

    3. Do you plan to implement this specification in your code? If not, why not?

    4. Do you have any security concerns related to this specification?

    5. Is the specification accurate and clearly written?

    Your feedback is appreciated!
There remains quite a lot to do, and you are more than welcome to join in the project. There is a Java library and we've integrated the specs into Bioclipse and Taverna (see Details behind the "Calling XMPP cloud services from Taverna2"), but there is no support for BioCatalogue yet, and no libraries for other programming language yet.

Friday, August 07, 2009

Searching PubChem from within Bioclipse

For the application note which we are about to submit, I was working on improving the PubChem Bioclipse API a bit, resulting in new download methods:

The search allows using PubChem Filters which provides many simple means to restrict the search results. For example, we can search molecules and restrict on the molecular weight:
lists ="malaria 300:500[MW]"))
Other filters you can use in (provided by PubChem itself), includes (with examples):
  • [el]:"Au[el]")
  • [inchi]:"\"InChI=1S/CH4/h1H4\"[inchi]")
  • [inchikey]:"VNWKTOKETHGBQD-UHFFFAOYSA-N[inchikey]")
  • [mimass]:"375.9785:375.9786[mimass]")
And many, many more... see the linked Filters page.

Now, you surely want to look at the hits, for which we use the molecular table editor:
list ="375.9785:375.9786[mimass]"))
cdk.saveSDFile("/Virtual/hits.sdf", list)"/Virtual/hits.sdf")
Resulting in:

Thursday, August 06, 2009

ChemSpider fail #1: SMILES

Cheminformatics is difficult, I know. But I thought I used a simple SMILES when I typed C1CNCC1, but ChemSpider got it wrong :) The correct structure should be pyrrolidine, not pyrrole. I always mix up those names, so defaulted to ChemSpider to give me the correct name, which ChemSpider knows and where it also has the SMILES correct... there just seems something wrong with there search dialog:

There has been some talk about a ChemSpider Bugzilla, but I don't think this has materialized yet, and I'll have the default to info-at-chemspider-dot-com ...

Wednesday, August 05, 2009

Running Bioclipse Plugin Unit tests: solving the XPCOM error

Sometimes you can feel so stupid. For example, when the answer is right on front of you, but only after many hours you realize the right question belonging to that answer. For example, take this answer:
    add the line: -Dorg.eclipse.swt.browser.XULRunnerPath=/usr/lib/xulrunner
This is the problem I was trying to solve: I'm running 64bit Ubuntu Jaunty with Eclipse 3.4.2 for Bioclipse development. The answer above is the correct answer. So, I added the line. To the $HOME/eclipse.ini and to the eclipse command line to start the program. But I still good not run Bioclipse plugin unit tests; I kept getting that stupid error:
    org.eclipse.swt.SWTError: XPCOM error -2147467262
    at org.eclipse.swt.browser.Mozilla.error( :1638)
    at org.eclipse.swt.browser.Mozilla.setText(Mozilla.ja va:1861)
In retrospect, I was sort of asking the wrong question. I should have asked myself not why I got that XPCOM error even though I was using the solution, but why running the unit tests was not affected by that solution. Realizing that, it became so obvious: the plugin unit testing was using a clean environment, not based on the Eclipse environment I was working in; therefore, adding that line to my Eclipse environment did not help. Instead, I only had to that line to the Run Configuration of my plugin unit tests too:

Surely, there are aspects to this which helped me overlook this solution. For example, I had installed Eclipse freshly yesterday, and then the it worked fine. Only after installing some EMF and GEF features, it stopped working again. Bitten by the correlation/causation pattern :(

Dear Advisory Board: which QSAR descriptors would you like to see implemented in the CDK?

Dear Advisory Board,

Ola Spjuth has recently been working on a extensive QSAR environment in Bioclipse, and molecular descriptors are provided using remote services but also using the CDK. The CDK has a relatively large collection of QSAR descriptors, but certainly not the full list discussed in the Handbook of Molecular Descriptor.

I'm sure everyone would appreciate a few more descriptors, and I am wondering which ones you would assign priority to. So:

which QSAR descriptors would you like to see implemented in the CDK?

Looking forward to hearing from you, preferable as comment in this blog, or via email to cdk-user mailing list or directly to me otherwise. Make sure to include a full reference to the paper that describes the algorithm.

Kind regards,


JChemPaint-Primary being picked up...

Backporting the JChemPaint-Primary patch for master to the cdk-1.2.x branch turned out to be fairly easy, but is a major step forward as we now have a patch to extend CDK 1.2.x with rendering support again, a major thing we lost when going from the CDK 1.0 to the 1.2 series.

For example, KNIME delayed moving to CDK 1.2 because of the lack of the renderer. Another project that really wanted to have the renderer was Ruby CDK (rcdk-ng, but not the same as the R rcdk package :), originally started by Rich, now maintained by Sebastian Klemm at the Institute for Pharmatechnology of the University of Applied Science, Switzerland.

Ruby CDK is a web environment for molecular structures, and based on the new rendering code, it looks like (copyright by Sebastian):