Tuesday, August 28, 2007

NXClient on Ubuntu Gutsy

If you, like me, already upgrade to Ubuntu Gutsy, and use nxclient for remote login (highly recommended, though proprietary code), you might run into the problem that the login no longer works, returning the message "Cannot find KDE environment.". Ubuntu's Lauchpad (generally an excellent service) was rather uncooperative and disregarded a bug report about the problem, I found the solution with grep -ri kde /usr/NX:
/usr/NX/etc/node.cfg:CommandStartKDE="/usr/bin/dbus-launch --exit-with-session startkde"

The solution was to:
sudo aptitude install dbus-x11
which contains the dbus-launch executable, formerly found in the dbus package itself. I assume it works for "Cannot find GNOME environment" too.

Monday, August 27, 2007

XCMS on Ubuntu Feisty

I just installed XCMS 1.9.2 on my Ubuntu system. XCMS is a GPL-ed R package for metabolomics data analysis. Just for the record, you need to install the Feisty packages for NetCDF:
sudo aptitude install netcdfg-dev libnetcdf3
R CMD INSTALL --library=/usr/local/lib/R/site-library xcms_1.9.2.tar.gz

Friday, August 24, 2007

JChemPaint too: PNG embedded connectivity tables

Rich blogged about Firefly embedding MDL molfiles in PNG images, which I found really cool. Rich and Noel later showed how that metadata can be retrieved again, possibly with Python.

But I did not like that Firefly could do this, and JChemPaint not. So, I started hacking. First I discovered I had to get rid of the use of JAI; then I had to adapt the JChemPaintPanel takeSnaphot() API to return a RendererImage; and finally, I had to figure out how to write the extra metadata. Now, Firefly is not opensource (yet), so it took me some time to figure out how that was done, and this is how:
ImageWriter writer = ImageIO.getImageWriters(
new ImageTypeSpecifier(awtImage), "png"
ImageTypeSpecifier specifier = new ImageTypeSpecifier(awtImage);
IIOMetadata meta = writer.getDefaultImageMetadata( specifier, null );

Node node = meta.getAsTree( "javax_imageio_png_1.0" );
IIOMetadataNode tExtNode = new IIOMetadataNode("tEXt");
IIOMetadataNode tExtEntryNode = new IIOMetadataNode("tEXtEntry");
tExtEntryNode.setAttribute( "keyword", "molfile" );
tExtEntryNode.setAttribute( "value", mdlMolfile);
meta.mergeTree("javax_imageio_png_1.0", node);
ImageOutputStream ios = ImageIO.createImageOutputStream(
new FileOutputStream(filename)
writer.write( meta, new IIOImage(awtImage, null, meta), null );

Now I can create my own test files for the Strigi's ability to extract chemical metadata from PNG images. Here is the JChemPaint generator PNG image for benzophenone:

Another issue, unrelated to this patch, is that writing PNG images changes the location of the structure in the JChemPaint editor, and that the placing of the element symbol in image writing is seriously broken. But that will soon be solved with Niels' new renderer.

The metadata looks like:

(Newlines are lost in the XML display.)

JChemPaint does not yet write InChIs, and it also does not open PNG images for input yet (as Firefly does).

Automatic Classification of thousands of Crystal Structures

Clustering and classification of crystal structures is hot. Parkin hit the front cover of CrystEngComm with a story on Comparing entire crystal structures: structural genetic fingerprinting (DOI:10.1039/b704177b). Now, the story itself, while rather interesting and well written, has three major flaws:
  1. the data set it way too small
  2. the proposed proof-of-concept is not novel at all
  3. they do not cite me
Well, the latter sounds a bit boohoo, and it is :) (BTW, I do like this paper.)

The propose the work as proof-of-concept, but use a very artificial data set of only 12 crystal structures (benzene and eleven polycyclic aromatic hydrocarbons, like naphtalene, anthracene, phenanthrene, triphenylene, pyrene, perylene, and coronene). While such a small set does make a nice example where you can still list all similarities (0.5*N*(N-1)), it is really too artificial.

Now, you may wonder if I am in the position to criticize this shortcoming, but I think I am. As part of my PhD work, I analyzed this problem myself, and published two years ago the paper Method for the computational comparison of crystal structures (DOI:10.1107/S0108768104028344). Apparently, Parkin was not aware of this publication and did not cite it. I should have went to a crystallography conference with a poster, and advertise my work more. In this paper, I analyzed a data set with 48 crystal structures, manually validated by visual inspection, resulting in having to compare 1128! crystal structure pairs. Took me two full weeks behind a Silicon Graphics. Yes, I really understand why they took only 12 structures :)

However, there is more prior art. While my approach was based on a new radial distibution function-based whole crystal structure descriptor, my supervisor (Ron) used the more common powder diffraction pattern and showed in Representing Structural Databases in a Self-Organising Map (DOI:10.1107/S0108768105020331) it to be a good enough descriptor for clustering of thousands of crystal structures using a self-organizing map (SOM).

Last week, my second paper in crystallography appeared: Supervised Self-Organizing Maps in Crystal Property and Structure Prediction (DOI:10.1021/cg060872y). In this paper, we show how supervised SOMs (see DOI:10.1016/j.chemolab.2006.02.003) can be used for supervised classification and even for property prediction. Note that these supervised SOMs are truly supervised, unlike many earlier modifications of the unsupervised SOMs: the training is supervised.

Finally, another advantage of this last work: the code is open source. The code for the unsupervised SOMs is available as R package: kohonen; and for powder diffraction patterns: wccsom. Details can be found in this R News issue. The first package is not actually limited to crystal structures, and can be used for any clustering problem. However, the articles mentioned here make use of simulated diffraction patters, and I am not sure there are open source tools to generate those.

BTW, I would still be interested in teaming up with CrystalEye in one way or another, and couple these data analysis methods to live streams of new crystal structures. Nick, let me know if you are interesting in idea exchange.

Getting back to Parkin's paper, I do like the work. Hirshfield surfaces are an interesting tool to visualize packing characteristics, and using them to describe a crystal structure sounds like an interesting idea indeed. I just hope that the method properly scales.

Wednesday, August 22, 2007

Dapagliflozin: the molecular structure

An anonymous reader reported that the American Medical Association published the structure of dapagliflozin. Here are the details.

The full name is (2S,3R,4R,5S,6R)-2- [4-chloro-3-(4-ethoxybenzyl)phenyl]-6- (hydroxymethyl)tetrahydro-2H- pyran-3,4,5-triol and the PDF report the CAS number 461432-26-8, and InChI=1/C21H25ClO6/c1-2-27- 15-6-3-12(4-7-15)9- 14-10-13(5-8-16(14) 22)21-20(26)19(25)18 (24)17(11-23)28-21 /h3-8,10,17-21,23-26H, 2,9,11H2,1H3/t17?,18?,19?, 20?,21-/m0/s1.

I have added this information to Wikipedia, see the Dapagliflozin entry.

Operator 0.8 released: a new Sechemtic user script

Mike release Operator 0.8, which picks up RDF (RDFa en eRDF) from HTML pages, and adds actions to it. I blogged earlier about the beta and wrote a script for it for chemical RDFa. At this moment, Chemical blogspace and RDF for Molecular Space (see this blog) are using chemical RDFa to semantically markup molecular information.

The new Operator release (download) has one notable API change: it now uses "RDF" as key for semantic information; the add-on now supports eRDF too. So, when installing or updating to version 0.8, you also need to update the Sechemtic user script to version 1.1 or better.

Installing Operator scripts is a bit more work than Greasemonkey userscripts. Save the script to your home directory, or any other place you can easily find on the hard disk. After installing the Operator add-on, click the Options button:

For the RDFa script to work, you need to make sure that the Display style is set to Data formats:

Then you can go to the User Scripts tab, and use the New button to add the script you downloaded and saved to your hard disk earlier:

Then, after rebooting Firefox, looks like MS-Windows :(, you can go to Chemical blogspace and look up molecules, and see output like that described in RDFa Operator in action on Cb.

Monday, August 13, 2007

Touchgraphing my blog

Via SciFoo Planet (from Partial immortalization)I learned about TouchGraph Google (Peter brought it into Chemical blogspace). It's cool, though not open source. Here's the touch graph for my blog:

As you can see, plenty of blogspot bloggers around me, among which, in purple, Useful Chemistry. Funny thing is, each time I repeat the Google search, the output is different. Oh, and make sure to drag one of the halos around; that will keep you procrastinating for the whole afternoon :)

Centralized or decentralized?

Peter wondered if data should be stored centralized or decentralized, when Deepak blogged about Freebase and Metaweb. Now, I haven't really looked into these two projects, but the question of centralized versus decentralized is interesting. It's MySQL versus the world wide web; it's the PubChem compound ID versus the InChI; it's versus info:inchi/InChI=1/CH4/h1H4 (see RDF-ing molecular space).

Both have advantages and disadvantages (everything does). Google has a huge experience with massive data, and is the centralized version of the distributed world wide web. Personally, I tend towards the decentralized version of things. Scales better. The chemical RDF community showed some concerns about scalability of triple stores (see e.g. Taylor et al. Bringing Chemical Data onto the Semantic Web, 2006, DOI 10.1021/ci050378m). Now, their tests went up to some 30M triples, which is barely enough to store the InChI, PubChem compound ID, and one chemical name.

So, how would this work for molecules then? I am leaning towards a system where one can query resources about one molecule, and work ones way through molecular space. Using KEGG, reaction databases, similarity stores, one could move from molecule to molecule, and add bits of RDF along the way, filling a local RDF store around the actual query I have in mind. For example, if I want to verify that the mass spectrum I found really belongs to the molecular structure I have in mind, I would look up in the resources I know about all triples that relate to the putative structure, and do my queries from there. That's what I would do... (and will do, but more on that later...)

Saturday, August 11, 2007

Molecular Connectivity Tables in Images

Rich blogged about to Never Draw the Same Molecule Twice: Viewing Image Metadata in which he shows his molecular editor outputting images of molecular structure where the connectivity table of structure is embedded in the image. His molecular editor can read the image again, and will automatically pick up the embedded connection table. Noel showed that such can not only be done in Java, but in Python too.

This is important progress, though I would still like to see InChIs in the documents, and/or the data files as supplementary information. Actually, I would even more like to see that all experimental sections not just list the structure name, but give the InChI. An important spin-off is that when giving spectral information, the atom numbering given by InChI can be used to associate NMR shifts, and IR wavenumbers to atoms and atom groups, removing the ambiguity in those associations as we are used to find in literature.

Chemistry Central is looking into improving the submission process for molecular data, and hereby request the commenting on, taking into account in ongoing internal discussings, and incorporation of these approaches in the editorial requirements for CC publications:
  • including the connection table as metadata in images
  • including the InChI in experimental sections for newly synthesized molecules
  • use InChI atom numbering to associate NMR shifts with atoms in these experimental sections

I will shortly blog an example experimental section incorporating the InChI.

Molecules in Wikipedia without InChIs

I reported last week about the Molecules in Wikipedia and the plethora of templates used. Chemical blogspace has also been using Wikipedia URLs as molecular identifier and extracting InChIs from the wiki pages (see Using Wikipedia to recognize Molecules in Blogspace). Several people have shown interest in adding InChIs for molecules in Wikipedia, so here's a new version of a list it molecules without InChIs: -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID

Strictly speaking, the list should be longer, as the code that produced this list actually is also happy when a PubChem compound identifier (CID) is given. The previous list is also still online.

Thursday, August 02, 2007

Molecules in Wikipedia

I do not care about physical and chemical properties in Wikipedia, as I can easily extract them from other sources. The main value of Wikipedia for molecules is, I think, that it describes the history of a molecule. Additionally, the Wikipedia URL is a nice unique molecular identifier (for example given certain conditions, and many bloggers are using it as such. But, it only is a useful identifier if one (and only one) InChI is stated on the wiki page.

Now that I am RDF-ing molecular space, I was again interested in dbpedia, a RDF version of Wikipedia. See these two blog items and Peter's very nice dbpedia, RDF and SPARQL - for chemistry item. Christian is picking this up, and extending dbpedia for support for the various chemical boxes.

Wikipedia Templates
I have spotted a couple of templates: Drugbox, Chembox, Chembox new, of which the last one seems to most recent, and has extensions for explosives and drugs. The WikiProject Chemicals does not mention it though. Anyone who knows the status? Is chembox new the way forward and going to replace the older chembox? I hope so, because only the newer one has InChI in the last of official fields. Or is chembox new simply an extension of chembox itself?

Somewhere between 1000 and 1500 entries use the chembox new and another 1000 to 1500 use chembox but I assume there is considerable overlap. Additionally, Christian noted that there still seem to be molecules in Wikipedia which do not use a template at all, and counted some 1900 molecules using various lists. If you you want to keep a more close eye on chemistry in dbpedia, you should register to the dbpedia-discussion mailing list.

Wednesday, August 01, 2007

Excel messes up your data analysis :)

Well, no wonder: Excel is meant to be used to process money flows. Anyway, greyarea pointed me to this nice blog item from March 2006. It discusses a 2004 article in BMC Bioinformatics Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics by Barry Zeeberg et al. (DOI:10.1186/1471-2105-5-80). Hence, the importance of semantics and proper markup languages. The quotes are illustrative:
    When we were beta-testing [two new bioinformatics programs] on microarray data, a frustrating problem occurred repeatedly: Some gene names kept bouncing back as "unknown." A little detective work revealed the reason: ... A default date conversion feature in Excel ... was altering gene names that it considered to look like dates. For example, the tumor suppressor DEC1 [Deleted in Esophageal Cancer 1] was being converted to '1-DEC.' Figure 1 lists 30 gene names that suffer an analogous fate.
    There is another default conversion problem for RIKEN clone identifiers identifiers of the form nnnnnnnEnn, where n denotes a digit. These identifiers are comprised of the serial number of the plate that contains the library, information on plate status, and the address of the clone. A search ... identified more than 2,000 such identifiers out of a total set of 60,770. For example, the RIKEN identifier "2310009E13" was converted irreversibly to the floating-point number "2.31E+13." A non-expert user might well fail to notice that approximately 3% of the identifiers on a microarray with tens of thousands of genes had been converted to an incorrect form, yet the potential for 2,000 identifiers to be transmogrified without notice is a considerable concern. Most important, these conversions to an internal date representation or floating-point number format are irreversible; the original gene name cannot be recovered.

Is this the article that made all bioinformaticians turn to R?