## Saturday, June 17, 2006

### KDE desktop search: Kat, Strigi and Tenor

Desktop searching has become a hot topic (some earlier blogs), now that years of data accumulated on ones hard disk: PDFs, OpenOffice.org documents, Latex manuscripts, old Java source code, digitized music, and a lot of chemical files. Well, on my hard disk that is. Unlike piles of paper, a computer could search this data, but due to the size an index is required. What's KDE4 going to offer?

For the KDE desktop Kat has for more than a year offered this, and latter Kerry came along as frontend to Beagle, though this does not have the nice integration with KDE kfile plugins. Since then, Kat developed has come to a stop (unfortunately), and attempts to reach the main author (Roberto) have been unsuccesfull. Last thing happening was a rewrite of the database backend.

Additionally, Scott Wheeler proposed Tenor on FOSDEM 2005: "KDE 4: Beyond Hierarchical Data, The Desktop as a Searchable Web of Context". A semantic desktop; potentially cool, but I have heard little from it lately, except for some rumours that Scott has some actual code at home.

Now, Strigi (download) has come along, with a fast indexing engine, just the thing where
the Kat developed seemed to have stopped. The design is different from that of Kat, but it does not seem unlikely that Kat code can be ported. No support for PDF or OpenOffice.org documents yet, but that's really the easy part, and kfile is on its way.

Getting back to Tenor, one might wonder how Strigi could implement Tenor concepts. A simple approach is at least to allow users to tag files, just like we have become used to with blogs (e.g. Technorati.com) and websites (e.g. Connotea). This could be easily implemented using extended attributes (xattr), already used by Beagle:
# file: home/egonw/1CRN.jpguser.Tenor.Keywords="crambin"user.Tenor.Comment="Used in my ontologies presentation."

Obviously, this example shows not just these tags, but a user comment too. The idea, here, is that Strigi mines these attributes in addition to the file itself, so that search on tags can be done too. BTW, my argument to use this, instead of putting these things in the Strigi database itself, is persistence: data and metadata are kept together. KDE's file properties dialog would be extended with an extra tab that allows editing these fields.

Strigi itself can be embedded in KDE applications to search specific information (e.g. search molecular data within Kalzium using the InChI), and even in the FileOpen dialog. We need patches for KDE4 that allows this, soon.

1. Another option is using libferris for index and search. See the top post here.

2. Why waste time reinventing the wheel when there is an actively maintained freedesktop alternative that outperforms beagle in almoost every area.

Its already integrated in nautilus search and gnome's deskbar and is by far faster and lighter than anything else out there.

Oh its written in C and uses dbus too so nice QT bindings should be a doddle.

the name of the product is Tracker and see its website for further details:

http://freedesktop.org/wiki/Software/Tracker

3. Thanx for these links. Both look interesting, and I don't know why one reinvents the wheel each time again.

4. Tracker says it's dektop-neutral but it has dependencies for building and on run-time for dbus-glib bindings and glib... Libextractor also adds gtk dependancy.

Also the stuff is GPL licensed, as is the embedded MySql database it depends on. This ultimately prevents Tracker from being utilized throughout KDE by entering kdelibs at any future time as it doesn't meet the hard requirement to be at least LGPL licensed. MySql was already sorted out, unlike e.g. Postgresql, as a possible database backend for Tenor because of the too restrictive license.

5. Tracker is desktop neutral and depends on glib (same as other freedesktop stuff like HAL and gstreamer which will no doubt be in KDE 4 in some capacity even if they are abstracted in Phono and Solid).

The gtk support in libextractor is optional and not needed (it will compile without it) but please note all metadata extraction is done out of process so it can easily be replaced with some of KDE's metadata extraction stuff or other helper apps (likewise for the text filters which are shell scripts that call helper apps)

Only the tracker daemon is GPL everything else is LGPL including libtracker (and because dbus access to the tracker daemon is over IPC that also is not constrained by the GPL).

Postgres cannot be embedded so its useless as a DB for this kind of thing (you cant force everyone to be a DB administrator!). As above as all DB access is done over IPC (DBus) the license of MySql is a complete non-issue.

6. Why would you waste time on OpenOffice.org files (.sxw et al)? I suggest that you implement OpenDocument support instead. :-)

Seriously, OpenDocument is a cross-platform, cross-application standard, and as a representative for KOffice, I'd be very greatful if you didn't refer to it as the OpenOffice.org format.

7. Hi Ingwa, apologies for not mentioning the ODP format. Two comments, ODP documents are XML and already indexed by Strigi, so no worries :)

Secondly, I do want to waste time on old formats, because exactly files in those old formats I forgot about, and want to use a search engine for.