Saturday, February 25, 2012

Groovy Cheminformatics 5th edition

This is more like the update interval I had originally in mind: one month per version. In fact, this is the first edition for the same CDK release as the previous edition, and thus is edition 1.4.7-1. Still, this release adds 12 new pages, consisting mostly of a new Chapter 14, about molecular descriptors, and the CDK API for descriptor calculation:

Other new content includes a short code example for generating 3D coordinates.

The paperbak is available from, an on-demand publisher, as well as this ebook version.

More publishing innovation #2: Searching LaTeX equations in literature

I reported yesterday about MathJax support by Chemistry Central, Bryan informed me a cool Springer project:

LaTeXSearch lets you search for mathematical equations in literature, and also is helpful in providing the LaTeX source of those equations. For example, in the above screenshot I searched for RMSE. But there are more interesting queries, like all equations with ħ or with Q2.

Friday, February 24, 2012

More publishing innovation: MathJax support by Chemistry Central

The Blue Obelisk Descriptor Ontology has been using MathML for a long time, and it is good to see that MathJax for visualization of mathematically equations is getting mainstream.

Thanx to Chemistry Central!

Thursday, February 23, 2012

CiTO / CiteULike: publishing innovation

Readers of my blog know I have been using the Citation Typing Ontology, CiTO (doi:10.1186/2041-1480-1-S1-S6). I allows me to see how the CDK is cited and used. CiteULike is currently adding more CiTO more functionality, which they started doing almost one and a half years ago.

One of the things, is that the CiTO data added via a certain account, can be downloaded as triples:

The second is that they are improving the graphics of how it is visualized. E.g. they added an 'Expand' link, which I found when they tweeted they had hidden drag-n-drop, which I haven't found yet, though. Clicking that action, will show you the following:

Because CiteULike takes advantage of the inverse function of the CiTO predictates, they show up with the cited paper too, which is less suitable for the top-down flow graphics:

To make this advertorial a bit balanced, not all my wishes have been implemented yet, and the next up from my perspective should be Linked Data. There is some Linked Data embedded as RDFa, but the latter is not turning out to be the killer I had hoped, and regular RDF entry points should be used.

Each CiteULike entry (post) should get a unique IRI (or URI) and opening that link should give RDF about that post (wish #10). That's is dereferencibility. The RDF can be, for example, in BIBO but there are many alternatives, and I have not been keeping up with which is the best (please leave a comment, if you have an opinion on that).

But I like where this is going! Thanx, CiteIReallyLikeThis!

Saturday, February 18, 2012

Chemical blogspace back online

I am happy that the Chemical blogspace is back in business. There were issues with the database earlier this week, but the website is back up. I am happy because if the pointers to interesting discussions and papers. For example, this funny entry in PubChem:

Really interesting too, is the difference in InChI between the Chemical blogspace and the PubChem entry.

Oh, and welcome back ChemBark! Check that URL; ChemBark is one of the oldest blogs on Chemical blogspace. Id = 8!

Friday, February 17, 2012

Chemical Interoperability

Below are the slides of the Chemical Interoperability presentation I gave this morning in Utrecht in a parallel meeting of the NBIC. The name was copied from the suggestion by Christine, who invited me. Most of the cited papers can be found on my Google Scholar profile. But also on Mendeley and CiteULike.

Wednesday, February 15, 2012

BridgeDB and Semantic Web for the Life Sciences

I joined the Bioinformatics group at Maastricht University (BiGCaT) at the start of this year. I have been using the semantic web in the past years in drug discovery (doi:10.1186/2041-1480-2-S1-S6) and toxicology (doi:10.1186/1756-0500-4-487), and organized a conference at the ACS meeting in Boston in 2010 (the matching article collection in the J. Cheminformatics). Here in Maastricht I’ll be working on BridgeDB and putting that to use in the Open PHACTS projects (and more).

BridgeDB was recently published (doi:10.1186/1471-2105-11-5) and is a combination of Open Source, a web services, and Creative Commons-licensed mapping data, that links many bio- and cheminformatics related databases. To learn more about the BridgeDB source code, I created a Bioclipse manager (see doi:10.1186/1471-2105-10-397) to make BridgeDB functionality (both the library and the webservice) accessible in the JavaScript environment. This is one of the scripts you can now try (of course, this is not how it will be used in Open PHACTS):

// BridgeDB is a database that knows about equivalence
// of data from different databases.

// The website has a number of examples of how to use
// the BridgeDB Java API to perform various of the
// tasks it can do.

// This script repeats those examples, using BSL.

// The first example shows a Ensemble Human identifier
// being defined as a BridgeDB Xref resource. In BSL,
// this example can be reproduces as:

ref = bridgedb.xref(“ENSG00000171105″, “EnHs”);
js.say(“<a href=\”" + ref.getUrl() + “\”>clicky</a>”);

// The second example maps a Entrez Gene (code ‘L’) to
// Ensemble Human (code ‘EnHs’):

dests =
  “idmapper-bridgerest:" +”,
  “3643″, “L”, “EnHs”
  bridgedb.xref(“3643″, “L”).getURN() + ” maps to:”
for (i = 0; i<dests.length(); i++) {
  js.say(” ” + dests.get(i).getURN());

// We can repeat this script also with a small
// modification in the first line to get mappings
// to any database, by simply dropping the target
// database code:

dests =
  “idmapper-bridgerest:" +”,
  “3643″, “L”
  bridgedb.xref(“3643″, “L”).getURN() + ” maps to:”
for (i = 0; i<dests.length(); i++) {
  js.say(” ” + dests.get(i).getURN());

// The next example from the BridgeDB wiki page
// demonstrates that this approach works for other
// (bio)chemical entities, like metabolites. The
// example maps an entry from ChEBI (code ‘Ce’) to
// PubChem (code ‘Cp’):
dests =
  “idmapper-bridgerest:" +”,
  “16811″, “Ce”, “Cp”
for (i = 0; i<dests.length(); i++) {
  js.say(” ” + dests.get(i).getURN());

// Searching is wrapped in the Bioclipse extension too.

query = “3643″;
hits =
  “idmapper-bridgerest:" +”,
  query, 100
js.say(query + ” search results:”);
for (i = 0; i<hits.length(); i++) {
  js.say(” ” + hits.get(i));

// And so is identifier typing

query = “NP_036430″;
js.say(“Which patterns match ” + query + “?”);
sources = bridgedb.guessIdentifierType(query);
for (i = 0; i<sources.length(); i++) {
  js.say(sources.get(i).getFullName() + ” matches!”);

The Bioclipse that has this functionality is available as alpha release from this website for various platforms. The source code is found on my GitHub account.

Monday, February 06, 2012

We need your submission! #openrescomp

Change is scary. Submitting your application paper to a new journal certainly. When that journal requires your have a strategy for code testing and maintenance even more.

Hence, the Open Research Computation dillema.

We you may or may not know, I'm on the editorial board of the yet-to-kick-of journal Open Research Computation (ORC, official website). People are scared to submit, and even the editors are reluctant with submitting work. Myself I have found the excuse of no time, to not submit something yet.

Indeed, application papers are the extra sugar, but projects and project deadlines favor a slightly different kind of paper. And, some uncertainty lies in the fact that ORC may not reach the same impact the NAR database special issue has.

However, I call upon everyone in the Open Science community, to submit a paper to ORC, describing your documented software and how it is tested. We need a sufficiently filled pipeline to make this happen.


  • your CRAN/BioConductor package
  • your software you used for a data analysis of an already published paper (clue: it gives you a chance to cite that paper in a meaningful way)
  • your Cytoscape, Bioclipse, Taverna plugin that was too small for a BMC Bioinformatics / JChemInf paper
Well, surprise me! If you are uncertain about the minimal publishable unit for ORC, please contact Cameron.

Sunday, February 05, 2012

Gerrit: code review for Git

Since we are getting more and more in trouble with SourceForge :( I started looking into the more standard git code review environment, called Gerrit, so that we can use that for the CDK. With some huge learning curve, lurking, googling, and seeing what Avagadro was doing (resulting in my first ever, trivial Avogadro patch), and a major headache. This is what my patch looks like in Avogadro's Gerrit install:

As you can see, Marcus reviewed my patch, and approved it. I am not sure if they have configured to have Gerrit automatically push to GitHub, but that is an option.

It turns out I could not find documentation how to set up Gerrit for a GitHub project, but ended up installing it. That basically consists of setting up MySQL (or equivalent) with a user account and database, create a Linux account, and then follow the install instructions in a .war file. Next then is to register an account, and it nicely picks up a Google Account, but OpenID seems supported too. The first account is automatically the administration account, and that is a good choice indeed.

Finding the right documentation for creating a new project from an existing project was tricky, and I ended up with this instruction. However, typing the second step in that:

git push ssh://egonw@localhost:29418/cdk *:*

causes this output:

Counting objects: 181366, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (32350/32350), done.
fatal: Unpack error, check server log 4.30 MiB | 8.54 MiB/s   
error: pack-objects died of signal 13
error: failed to push some refs to 'ssh://egonw@localhost:29418/cdk'

With this stacktrace in Gerrit's error log:

Caused by: Invalid tree 00ba05c8a75c3fdd3022fd87d92694e87556acb8:mode starts with '0'
        at org.eclipse.jgit.transport.PackParser.verifySafeObject(
        at org.eclipse.jgit.transport.PackParser.whole(
        at org.eclipse.jgit.transport.PackParser.indexOneObject(
        at org.eclipse.jgit.transport.PackParser.parse(
        at org.eclipse.jgit.transport.ReceivePack.receivePack(
        at org.eclipse.jgit.transport.ReceivePack.service(
        ... 15 more

It seems I am running into this know bug :( I added my experience with Gerrit 2.2.2 to the report, but the original issue is over a year old. There is little comment, let alone a workaround, so this project is now on hold... :(

Update: I figured out how to set up a git repository manually on the Gerrit server, and managed to to push a patch for review: