Wednesday, May 30, 2007

ChemRank: ranking scientific literature

Mitch just launched ChemRank, a website where we can comment on and vote thumbs up or down for scientific articles. Good initiative I think. Some thoughts:
  • please include the DOI for each article overview on the front page (see why)
  • make the content opendata, e.g. using the CC license
  • provide a means to refer to other literature to back up comments and ranking
  • provide an API to make mashups (like that of Chemical blogspace for use in Greasemonkey scripts)
  • make the website source code opensource (JSON, RDF come to mind)
  • use microformats where possible (for Operator and FF3)
  • at least provide means for tagging articles
  • provide browsing by journal
  • import articles from Connotea/NatureNetwork/etc

Please consider there as feature requests, and not as critique. Two of these are already listed in the developers wishlist. I will likely come up with more later :)

Weka Decision Trees to Java Conversion

Some time ago I wrote a small Perl script to convert a decision tree created with Weka in the ARFF format to Java source code, for use in the ionization potential prediction in CDK. The advantage is that Weka is no longer used are runtime, and that there is no model that needs to be loaded and interpreted. Instead, it is simple Java code that does the work, much faster.

This is the code:

# Copyright 2007 (C) Egon Willighagen
# License: GPL

use diagnostics;
use strict;

my $filename = $ARGV[0];

print "double result = 0.0;\n";
open(INPUT, "<$filename");
my $level = 0;
my $prevLevel = -1;
while (my $line = <INPUT>) {
$line =~ s/\n//g;
$level = 0;
while ($line =~ /^\|\s*(.*)/) {
$line = $1;
my $else = "";
if ($prevLevel == $level) {
$else = "else ";
} elsif ($prevLevel < $level) {
# we increase one level at a time
for (my $i=0; $i<($level-1); $i++) { print " "; };
print "{\n";
$prevLevel = $level;
} else {
# this is a bit more tricky: we possibly need more than
# one end bracket
my $diff = $prevLevel - $level;
for (my $closes=0; $closes<$diff; $closes++) {
for (my $i=0; $i<($prevLevel-$closes-1); $i++) { print " "; };
print "}\n";
$prevLevel = $level;
if ($line =~ /:/) {
my ($if, $then) = split(":",$line);
for (my $i=0; $i<$level; $i++) { print " "; };
# FIXME: java-fy $then
if ($then =~ /([\d|_]*)\s*\(([^\)]*)\)/) {
my $result = $1;
my $stats = $2;
$result =~ s/_/\./g;
print $else . "if ($if) { result = $result; // $stats }\n";
} else {
print $else . "if ($if) { result = $then; }\n";
} else {
for (my $i=0; $i<$level; $i++) { print " "; };
print $else . "if ($line)\n";

# OK, now add the rest of the closing brackets
for (my $closes=$prevLevel; $closes>0; $closes--) {
for (my $i=0; $i<($closes-1); $i++) { print " "; };
print "}\n";

Tuesday, May 29, 2007

The JCIM is linking to Planet Blue Obelisk??

I use Google Analytics to analyze the visitors of my blogs and of Planet Blue Obelisk too. Now, for the past couple of weeks, the webpage of the Journal of Chemical Information and Modeling is showing up as refering site:

What is going on here ?!?! This is really no fake, but cannot find an actual link when I visit the journal webpage either...

Friday, May 25, 2007

Numbers are copyrighted?

I just read on Planet Blue Obelisk Peter's disturbing news (via Suber) that Wiley thinks it can copyright a set of numbers (also known as data). That is a sad milestone in scientific publishing. It reminds me of the recent internet hype about a long number recently flooding the internet (and notably related to watching DVDs you legally bought. Some details can be found in this Linux Weekly News article on How Debian packages a number.

Interestingly, this is really not problems just regarding commercial publishers, or closed access publishing or so. Yesterday, Christoph and I working on getting the NMR spectrum text mining going in Bioclipse again for the workshop, we noticed that the open access Beilstein Journal of Organic Chemistry, does not make Open Data reality either: the experimental sections are generally (all?) excluded from the main text in HTML and obscured in .doc files in the supplementary information. BTW, this makes me wonder if organic chemists still consider the experimental properties of molecules novel science.

Friday, May 11, 2007

Added my hCard to my blog

Getting back on microformats (see yesterday), I added my hCard to the bottom of my blog:

I will likely populate it a bit more soon (after holiday in Sweden).

Now, if you had the Firefox plugin Operator installed, you would have my contact information show up in your FF toolbar, like this:

Note the 'Export Contact' button in the toolbar. This will automatically create a vCard which I can directly open in my address book (I use the KDE addressbook). Very nice integration!

Now, I already asked the author how the plugin could be extended to support chemical microformats. Just think of the feature "Export Molecule (137)" (e.g. to Bioclipse), when reading a HTML version of paper in one of the Project Prospect enabled journals :)

Microformats in chemistry...

Peter blogged some days ago about microformats and how they could be used in chemistry. Being late and a bit absent minded, I added a short comment that Chemical blogspace supports microformats for chemistry, and that chemistry is harvested from that, and actually semantically distributed again using CMLRSS.

In reply to my comment, he wrote a follow up highlighting one of blog items linked above (thanx for that!). Accidentally, he also published my Gmail account and IP address, which was really just for the blog owner to see who did the comment, and not for the world to harvest. This is a moment I am not so happy that Peter's blog is so popular ;) Peter, maybe be a bit more careful with copy/pasting next time.

Peter and Henry (still not in blogspace?) have been doing things along these lines for years now, often in different contexts. But getting these things going is a bit trickier. Actually, the take up of the chemical microformats has been limited, and at least one alternative mechanism is being used: put the InChI in the @alt attribute on the <img> element. Other alternatives are possible too, such as recognizing molecules (or whatever else) based on a link to wikipedia; linking to entries in wikipedia is popular in Chemical blogspace.

One problem in getting microformats accepted, especially among chemists, is to have tools available. Tools meaning dedicated plugins for blogging software to easy adding microformats to a blog item. You'd be suprised how uncommon raw HTML editing has become in the last 10 years. ::: Structured Blogging ::: is a provider of such tools. On the using site, there is this nice Firefox plugin, that can extract information available in microformats, though Firefox3 is supposed to support some microformats natively.

Just today, Peter also blogged about a Berner-Lee's presentation with the nice circular phenomena in all these web technologies. The diagrams nicely visualize the complex social aspects of these new technologies. (I'm sure the apply to chemoinformatics too... who makes a chemoinfo variant?) RDF is the way to go; it's the machine interpretable (well, more accurate) microformat. All sorts of information is getting available as RDF. For example, check out bibtex2rdf, Wikipedia as RDF, uniprotRDF, and BioGUID. Moreover, GRDDL might mave this even more common.
I have been maintaining a bookmark list of RDF things happening, check it out, the list is social and using microformats.

Sunday, May 06, 2007

Preparing a Chemoinformatics workshop

After handing in a new draft of my PhD manuscript with my co-promotors last friday, and a week before we leave for Sweden, it is time to start finishing up the material for my one hour workshop on chemoinformatics in general and QSAR/QSPR in particular for the Bioclipse Workshop.

Pierre blogged about this movie. It looks relevant:

Thursday, May 03, 2007

Cb comments for InChI's

About a year ago Pedro wrote a Greasemonkey script to add comments from to table of contents of scientific journals. Noel extended it with support for Chemical blogspace (see also this earlier item). Now, the later website is maintained by me, and I extended the aggregator software with molecule support, for example to show hot molecules on the frontpage (at some point my patches will be backported into mainstream. Euan, why not invite me to London HQ in, say, June?).

So, when we can show comments from blogosphere for journal articles, why can't we do that for molecules too? Sure we can. Just needs some hacking. Right, and done that today. The scripts works for PubChem:

Works for any <a href> element with an URL to PubChem like[InChI]. BTW, while the URL is not very readable, this might actually be a good way to hide InChI's, though I am sure Google will not index this InChI either.

And it also works for semantically marked up InChI's (using either microformats or RDFa):

You'll notice here that it is friendly with my Sechemtic script to make links to Google and PubChem.

The tools to make this happen involves a new Greasemonkey script (based on Noels code), and a few patches to the software. The user script can be downloaded here. An entry on the Blue Obelisk userscript page will follow; check that page for more goodies.