Sunday, June 25, 2006

KDE4 keyword support mockups

In reply to interesting comments to my previous blog on Strigi and xAttr support in KDE4, I would like to suggest the following mockups, which I would find very useful. The deal with the ability to store keywords, for example, not but necessarily using xAttr. I have no idea on how to implement these mockups, so any help or pointers are appreciated.

The first plot is an example of how these keyword markup could be used in KDE, other than searching itself. When showing the properties of a directory in KDE, it would show an overview of hottest keywords for that directory, such as used on social bookmark website like Technorati too:

This example shows that the keyword 'Strigi' was used much inside the index_files directory (they are not just the keywords given for that directory, but a summary of the directory content!). Now, these keywords could be stored as xAttr, but in a database too. The first requires a filesystem that supports xAttr, while the second requires a database daemon to be running. However, for speed performance reasons this would be required anyway. Strigi indexes
xAttr now (post 0.3.0 release), and basically allows both.

Independent of the chosen/prefered way to store keywords, these keywords can be edited from the Properties dialog:

Now comes the tricky part: though I would like to add this to KDE, I do not have the C++/KDE experience to actually do this. I'm already happy that I was able to extend the Strigi with support for KDE's kfile architecture. Yes, the Strigi version in SVN will index all metadata extractable with kfile plugins installed on the KDE installation.

Thursday, June 22, 2006

Text mining for chemistry using OSCAR3

Peter Corbett from Peter Murray-Rust's group at the Unilever Cambridge Centre for Molecular Informatics visited Christoph Steinbeck's junior Research Group on Molecular Informatics at the CUBIC today, and spoke about the status of Oscar3, a chemistry text mining program with the Artistic License. Oscar3, the successor of version 1 and 2, can detect and extract molecular structures and experimental details from plain text articles, using a variety of text mining techniques.

The afternoon was spend on hacking Oscar3 into Bioclipse, with good success. It involved updating Oscar3 for the latest CDK and setting up a plugin infrastructure for Bioclipse. This plugin will allow mining (scientific) articles for chemical compounds and there properties from within Bioclipse. The outcome of today's hacking session was somewhat less ambitious and focused on the general infrastructure, and getting the OPSIN functionality in Oscar3 available as a wizard. OPSIN is a IUPAC name 2 structure tool and, amongst many other names, is able to recognize caffeine (InChI=1/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3):

Tuesday, June 20, 2006

Strigi gets kfile plugin support

With some help, I got the kfile stream analyzer for Strigi working. This means that Strigi will now index the meta data fields defined by the kfile-chemical plugins.

The problem why it was not working earlier, was that it segfaulted on every creation of KDE classes. That's something I really hate about C/C++: the lack of stack traces, though valgrind was helpful. It turned out that adding the below line fixed all. A KInstance is needed when using KDE technology outside a KDE program:

KInstance instance( "strigita_kfile" );

Combine this with the xattr support added by Jos earlier today, I hope to see an interesting new Strigi release soon! Now we only need to get editing of keywords into KDE4.

Dutch Summer of Code sponsors a Bioclipse project

The Dutch version of the Google Summer of Code,, announced today the five students participating. I was happy to see that Rob Schellhorn was selected with his project proposal for a Ghemical plugin for Bioclipse. Like in the Google original, both the student and the mentoring organization are funded, 3600 and 400 euro respectively.

Saturday, June 17, 2006

KDE desktop search: Kat, Strigi and Tenor

Desktop searching has become a hot topic (some earlier blogs), now that years of data accumulated on ones hard disk: PDFs, documents, Latex manuscripts, old Java source code, digitized music, and a lot of chemical files. Well, on my hard disk that is. Unlike piles of paper, a computer could search this data, but due to the size an index is required. What's KDE4 going to offer?

For the KDE desktop Kat has for more than a year offered this, and latter Kerry came along as frontend to Beagle, though this does not have the nice integration with KDE kfile plugins. Since then, Kat developed has come to a stop (unfortunately), and attempts to reach the main author (Roberto) have been unsuccesfull. Last thing happening was a rewrite of the database backend.

Additionally, Scott Wheeler proposed Tenor on FOSDEM 2005: "KDE 4: Beyond Hierarchical Data, The Desktop as a Searchable Web of Context". A semantic desktop; potentially cool, but I have heard little from it lately, except for some rumours that Scott has some actual code at home.

Now, Strigi (download) has come along, with a fast indexing engine, just the thing where
the Kat developed seemed to have stopped. The design is different from that of Kat, but it does not seem unlikely that Kat code can be ported. No support for PDF or documents yet, but that's really the easy part, and kfile is on its way.

Getting back to Tenor, one might wonder how Strigi could implement Tenor concepts. A simple approach is at least to allow users to tag files, just like we have become used to with blogs (e.g. and websites (e.g. Connotea). This could be easily implemented using extended attributes (xattr), already used by Beagle:

# file: home/egonw/1CRN.jpg
user.Tenor.Comment="Used in my ontologies presentation."

Obviously, this example shows not just these tags, but a user comment too. The idea, here, is that Strigi mines these attributes in addition to the file itself, so that search on tags can be done too. BTW, my argument to use this, instead of putting these things in the Strigi database itself, is persistence: data and metadata are kept together. KDE's file properties dialog would be extended with an extra tab that allows editing these fields.

Strigi itself can be embedded in KDE applications to search specific information (e.g. search molecular data within Kalzium using the InChI), and even in the FileOpen dialog. We need patches for KDE4 that allows this, soon.

Monday, June 05, 2006

Recent Developments of the Chemistry Development Kit

Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics discusses (reasonably) recent additions to the CDK. It appeared in issue 17 of this years Current Pharmaceutical Design volume, after being too long in the queue after being accepted; but I am happy that it is out now.

The article discusses CDK's QSAR capabilities (the class designs and an overview of provided descriptors), the 3D model builder (see also C. Hoppe, CDK News, 1(2):4-5) and and the interface to the statistical software R (see also CDK News, vol.2, issue 1). The article is part of a small special issue on Computational Applications in Medicinal Chemistry.

CDK's QSAR package comes with one main requirement: the outcome of QSAR descriptor calculations must be reproducable. "Science must be reproducable"; I'm sure someone once said this :) Therefore, each QSAR descriptor has a specification pointing the a unique algorithm found in an ontology (see diagram below). This QSAR descriptor ontology is maintained by the project, which is project independent, and even welcomes proprietary programs to discuss interoperability.

And calculated descriptors are explicitely linked to this specification again, though it is up to the user to do with this what he wants:

Note that code has evolved since this publication, so class, interface and method names may have changed a bit.