Friday, May 26, 2006

Molecular indexing on the KDE and OS/X desktops

Geoff Hutchinson should blogged about his OS/X ChemSpotLight, an indexing tool for chemistry documents. It's like, but more advanced than, the kfile_chemical and Kat I have been working on (with others) for the KDE desktop (see earlier blog items).

ChemSpotLight currently does more than the KDE tools: it adds Spotlight comments. I assume these are like the Linux extended attributes, used for example by Beagle. For example, a file indexed by Beagle will have extended attributes like:

# file: home/egonw/m43.jpg
user.Beagle.Filter="003 Beagle.Filters.FilterJpeg"
user.Beagle.Fingerprint="02 xHn5Yi58x0eoI8ityBYkUw"

This is very suitable for adding metadata, like comments as in ChemSpotLight. Geoff's program adds metadata like number of atoms and bond, but it calculates the SMILES and InChI on the fly too. Especially the last is very good for indexing purposes, as it is a really unique identifier for molecular structures, and even works for proteins.

Now, kfile_chemical is a kfile plugin. These kfile plugins only extract metadata from files, and have little to do with calculated metadata. Kat, on the other hand, is an indexing application and might be expected to add additional, derived or calculated, metadata as extended attributes, just like Beagle does. And then InChI and SMILES are good candidates.


  1. So what you're saying is that on KDE, there are two pieces? Previously you had said in e-mail that anything added by kfile_chemical would be automatically indexed.

    If that's so, it's a fine line between the two processes you mention.

    Apple's Spotlight does it all in one pass -- some metadata can be marked to be displayed to the user in the Finder browser. All data is available for searching.

    So for example, the fancy Unicode subscript numerals for the chemical formula are displayed -- but it would be difficult to search on those exactly, so there's a hidden formula attribute that matches C6H6 for benzene.

    The worry I have with two passes comes with formats like MDL Molfile or SMILES which have implicit hydrogens. How do you provide much real metadata unless you parse the real chemistry?


  2. The design of Kat is as follows: it has index plugins for full text, and for kfile plugins. These kfile plugins itself are the KDE way of showing metadata extracted from the file. The full text plugins allow indexing of the whole file, and in principle allows adding of derived fields.

    Kat does not have a special plugin architecture yet to create derived metadata which it would index too. So, while the full text plugin could add the InChI, the indexer (Kat) would tread it as full text, and use delimeters for normal text. In no way it is currently possible to add a string like {"InChI", "InChI=1/bla/bla"}. No namespacing for example. Linux' extended attributes does allow this, and so does Spotlight.

    Concluding, in the current design, parsing of the real chemistry would be done not by the kfile plugins though this is actual done a bit, I think. Jerome?), but by the full text plugins. However, the design of the latter, does not allow adding semantic meaning to the returned full text, to be indexed :(