Sunday, September 30, 2007

CompLife2007, Utrecht/NL; Taverna, EBI/Hinxton/UK

Two working days left before I'm off to two conferences. First, next Thursday/Friday, the two day CompLife2007 in Utrecht/NL, with sessions on genomics, systems biology, medical information and data analysis. And, on the second day tutorials on KNIME and CDK/Bioclipse. I will try to orient as much as possible around MS-based metabolomics, and metabolite identity in particular. Last year the conference was very interesting.

The Monday/Tuesday after that, I will present CDK-Taverna integration I worked on in 2005 (see e.g. Taverna on Classpath and CDK-Taverna fully recognized) at the Taverna meeting, before Thomas continued on that leading to the plugin website. If time permits, I will prepare an example workflow from metabolomics. Unlike previous times I went to Cambridgeshire, I won't fly in on Stansted, but take the EuroStar instead. I am very much looking forward to that. Unfortunately, I will not have time to visit Cambridge itself, this time :(

Friday, September 28, 2007

SMILES to become an Open Standard

Craig James wants to make SMILES an open standard, and this has been received with much enthusiasm. SMILES (Simplified molecular input line entry specification) is a de facto standard in chemoinformatics, but the specification is not overly clear, which Craig wants to address. The draft is CC-licensed and will be discussed on the new Blue Obelisk blueobelisk-smiles mailing list.

Illustrative is my confusion about the sp2 hybridized atoms, which use lower case element symbols in SMILES. Very often this is seen as indicating aromaticity. I have written up the arguments supporting both views in the CDK wiki. I held the position that lower case elements indicated sp2 hybridization, and the CDK SMILES parser was converted accordingly some years ago. A recent discussion, however, stirred up the discussion once more (which led to the aforementioned wiki page).

You can imagine my excitement when I looked up the meaning in the new draft. It states: The formal meaning of a lowercase "aromatic" element in a SMILES string is that the atom is in the sp2 electronic state. When generating a normalized SMILES, all sp2 atoms are written using a lowercase first character of the atomic symbol. When parsing a SMILES, a parser must note the sp2 designation of each atom on input, then when the parsing is complete, the SMILES software must verify that electrons can be assigned without violating the valence rules, consistent with the sp2 markings, the specified or implied hydrogens, external bonds, and charges on the atoms..

Monday, September 24, 2007

Google's view in history

Pierre pointed me to Google's view:timeline feature, which shows the search results on a time line, by recognizing phrases like "on 25 September 2000...". This is its view on the Chemistry Development Kit:

Friday, September 21, 2007

re: ACS RSS feeds are messed up

A couple of people now confirmed the problem with the ACS journal RSS feeds. Being back behind my desktop machine, I can post the obligatory screenshot:

The feed for Chemical Biology shows 79 feed items and the first one was a Environ. Sci. Technol. paper. The first of the 108 papers listed in the feed for Molecular Pharmaceutics is a perp from J.Phys.Chem.C..

Tuning Google Results?

I was just about to install Subclipse (for the millionth time), and googled for the update site details:

Does anyone know how you can get Google pick up or how it detects the Download/UpdateSite/etc pages, shown as direct links below the primary hit? Are HTML <link> elements used for that? Or does it use certain meta data, microformats, ...?

Thursday, September 20, 2007

SWT View with the new JChemPaint

The second Programmeerzomer and the second summer of code for me, will end tomorrow with a presentation of Niels on his new JChemPaint code. The summer is over before you know it. One of the goals was making the JChemPaint editor Swing independent and more easy to integrate with SWT widgets.

So, I hacked up the last bits of Bioclipse code. However, the CDK version in the net.bioclipse.cdk is still CDK 1.0, and Niels' requires CDK trunk/. So I copied in the cdk.jar and cdk-jchempaint.jar from trunk/ into the net.bioclipse.cdk.progz plugin, only to find out that that gives binary problems: the plugin would still depend on net.bioclipse.cdk.ui which depends on the CDK 1.0 plugin...

To get something going, I removed the dependency on CDKResource. So, the screenshot above is a bit artificial: it shows a static picture and the View does not react on ISelectionEvent's. But that is a Bioclipse/CDK issue, and not caused by Niels' code. Additionally, it does not handle mouse events or so. For that, I need to make it an SWT Editor first.

Wednesday, September 19, 2007

Tagging Molecules: a mashup of Connotea and RDF

Using the InChI and the new website, it is now possible to tag molecules. And if you use Connotea for that, your tags will even show up on the website. For example, at the time of writing, methane was tagged with alkanes and gas:

The trick I use, is that the gives every molecule a unique HTTP URL. This simply web2.0 approach offers an enormous amount of possibilities. The simplest application is that you can tag your molecules with a set label, such as my-tomato-set; after all, Connotea is account based. In this way, you can do open notebook QSAR studies (though the activities would still be missing).

The aforementioned example, however, give two classifications. Methane is an alkane and Methane is a gas (at room temperature). Not very well determined semantics, but it is web2.0, not the semantic web.

Interestingly, given some loosely defined semantics, use Connotea to link a molecule to a certain publication. For example, I can define Estrone is cited in the article with PubMed ID 15659855 using the tag pmid:15659855. I'm sure using the DOI would work too, using the tag doi:10.1107/S0108768104028344. I have not used these informal semantics in the website yet, but if there is such interest, I can have such functionality hacked in minutes.

BTW, did anyone see Gene Ontology terms being used in social bookmarking services? For example, seeing a link to the PDB database with a tag go:0008152? Would be a bit cryptic, and, really, in this case rather minimalistic on information.

What's next?
Now comes the tedious task of converting the QSAR data sets I used in my PhD research with these tags. It's really something I wanted to do for a while now. Next on my TODO list is the Greasemonkey script that adds the tags from Connotea to PubChem.

Sunday, September 16, 2007

ACS RSS feeds are messed up

All start is difficult. The ACS must know that, but they still blame Google. In this blog Everyday Scientist mentions that the ACS RSS journal TOC feeds are sometimes messed up. I noted that too, but lived with it. The ACS generally is a very professional organization, but when I read they told ES that his Google RSS client was the problem, I just had to confirm his problems, hoping that some ACS representative can relay the message to their IT department.

Akregator is the RSS client that I use, and I noticed the exact same problem ES noted: every now and then, twice, three times a month, the RSS feeds for one journal give the content of another journal. For example, I get the TOC of Chemical Reviews in the RSS feed of the JCIM. Things like that.

Now, because Akregator has absolute nothing to do with Google, it cannot be Google who is messing up the RSS feeds. Because ES is having the same issues I have, it cannot be Akregator either. Ergo, it is the RSS feed system of the ACS itself that is messed up.

Why plain QSAR is not enough for me...

Amanda had a very nice post on Small molecules that modulate quorum sensing. It's the perfect read for a Sunday morning, when you have a view looking down on Strasbourg from a hill in the Black Forrest. Biology fascinates me, particularly when small molecules are involved. And the molecular signaling used by these bacteria is just delightful. Make sure to read up on the small squids in 96-well plates too! (And we are worried about varkensflats! That's put in perspective :) These very small squids have a symbiosis with bacteria that light up under certain conditions, and this squid species learned how to control that lightning. Nerdy facts like this adds that coolness factor that outliers in QSAR lack.

Small-molecule macroarrays
Another bit in Amanda's blog catched my eye too: the small-molecule macroarray. I had not seen that term before, and looked up the paper by Brown et al. Rapid Identification of Antibacterial Agents Effective against Staphylococcus aureus Using Small-Molecule Macroarrays (DOI:10.1016/j.chembiol.2007.03.006). Like the more famous (gene expression) microarray, this SMMs are arrays of wells where small molecules are connected to a planer cellulose support system, after which the antibacterial activity can be measure. Now, I do have to read up on this technology. For example, are the small-molecule inhibitors released into the assay medium at some point? That is, they will need to find their way to whatever protein it inhibits, as the protein will not go to the support system. Can anyone explain me how to inhibition takes place?

Thursday, September 13, 2007

Outscoring old science

Rich posted a nice quote the other day on the introduction of the forward pass in football some 100 years ago, and linked that to sciences. I commented with the remark that the outscoring is the problem:
    The big question is: how do we measure our outscore. The other football teams would not have switched too, if the success of the St.Luois team if the outscore was obscured.

    In openaccess publications, there is a slight outscore: higher impact for openaccess publications. But I do not feel this effect is as pronounced as in the football example.

    You got a good statistics to impress people new the forward-pass in science?

Just after that, I read this blog by Antony on survival-of-the-fittest chemical search engine. Even though the measurement of the score is easy, these statistics can easily be obfuscated. Independent rankings, like Google Rank and Alexa Rank, may help.

However, what we really need is a direct competition. Us against them, old against new. I don't mind to be in either group, as long as it is the fittest. But, we urgently need to define what fittest is. Agreeing with Timo's statement (e.g. "It therefore troubled me that the initial counterattacks on PRISM were themselves often lacking in nuance and discrimination.", we need exact measures to do the discrimination. Each team prepares for the game, plays the competition, indepdent scoring, and there is your 142-11 outscore. PRISM versus PloS, Modgraph versus ACD/Labs, CDK against OpenBabel, KNIME versus Taverna, JOELib versus Dragon, microformats versus RDFa, openscience versus patents, PLS versus SVR, gemini versus single-tail surfactants, ... Bring on those competations! Let the score be clear (open), fair, and discriminating!

Maybe this is something we should set up with the Blue Obelisk: a yearly competition, with various categories (think: databases, prediction, modeling, ...), with scientific relevant judging.

May the best team win!

Friday, September 07, 2007

New InChI software beta: license issues resolved and InChIKey

The IUPAC/NIST team made a beta release of the next InChI software release:
    The principal new features of this release are:

    1. A fixed-length (25-character) condensed digital representation of the Identifier to be known as InChIKey. In particular, this will:

      • facilitate web searching, previously complicated by unpredictable breaking of InChI character
        strings by search engines
      • allow development of a web-based InChI lookup service
      • permit an InChI representation to be stored in fixed length fields
      • make chemical structure database indexing easier
      • allow verification of InChI strings after network transmission.

    2. Restructured InChI-generating software that separates key steps in its creation from an input chemical structure file. Among other uses, this allows checking of intermediate results to enable easier testing and development of InChI-based applications.

    3. Bug fixes designed to withstand malicious attempts to attack a Web server by providing a specially designed InChI string input to InChI binaries.

    We would welcome reports of your experiences with this new release and, of course, any problems.

A had heard about the InChIKey extension earlier, and it solves the issue some people have with the InChI: it is too long. Well, molecules can have many atoms indeed. It is important to realize the InChIKey is not a replacement: it simply is not unique. The collision probability is calculated to be rather small, though. But clashes may occur, and sees from the above statistics quite likely for the number of molecules estimated to be drug-like, which is estimated at ~1060. Moreover, these are theoretical probabilities which may not apply to the subset of molecules we actually tend to look at.

Anyway, the InChIKey is not a unique identifier, and never use it as such; that's what you need to remember.

An interesting feature is that addition of a check character, which enables some verification of typos. Nothing said about collision clashes there, which exist too. And the fixed length has its virtues too. That said, it certainly helps as sort of prefiltering. Google does a quite decent lookup of InChIs nowadays, and there is a growing amount of semantic markup of InChIs like use of microformats, as RDF/RDFa, stored in HTML @alt attributes, embedded in PNG images to address the issues of the InChI length.

Two final comments, and I hope Alan, Steve, Igor, Steve and Dmitrii will pick this up:

  1. the InChIKey lost the version layer, which will cause trouble when the InChI moves to a next version (as in InChI=2/.... I would really like to see InChIKey=1/RYYVLZVUVIJVGH-UHFFFAOYAW as key instead.
  2. an online service to validate the key using the check character would be most welcome

LGPL license
Not reported in the above announcement is the fact that this release also addresses a issue brought forward by the opensource community. License ambiguity has been addressed, and it is reported that the release now clearly states the LGPL license in the distribution as well as source code headers. This will make packaging for, for example, Linux distributions possible.

One of the reasons why there has not been a Java port developed was the lack of modularization in the InChI software. This apparently has now been added, and I am very interested in reading about the effective modules available now. In particular, the canonicalization is interesting. The resulting atom ordering find its use in chemoinformatics algorithms, and a standard for that is most welcome.

Maybe now is the time to develop a Java version of the software.

Double-charging your readers: quite unacceptable indeed

Peter has been doing an excellent job in advocating ODOSOS, and one of his posts even hit Slashdot.

Meanwhile, blogspace has been flooded with dislike of the PRISM intiative (e.g. see also the other Peter's blog). The website is so sad, it is almost funny again; but on second thought, it is so sad, you wonder the world will end because of WOIII or because of a total halt of scientific progress. It's so sad, it is hard to decide between the real webpage and this parody which is the fake one.

Wiley seems to be the king of commercial exploitation. While the sue over 6 data points seemed to be an incident, they now try to get their reading public pay twice for published material: once for reading the paper (well, if you exclude incidental, oh-I-m-sorry-our-IT-department-messed-up attempts to have readers pay for open access papers; or was that another publisher?), and once for accessing the data (spectra) in that paper.

I am likely a bit too harsh on Wiley here. They do and have done an excellent job on dissemination of scientific knowledge. I just think that it would suit them well to allow taking advantage of current ICT/chemoinformatics technologies to improve the advance of science; I would say that should be a goal of a scientific publisher. Instead, they do not give explicit permission to reuse data from their publications, unless it involves the commercial exploitation of that database. Sure, curation is expensive, but chemoinformatics has advanced, and *very much* can be done with an uncurated database. There are enough people interested in setting up free databases, without that costing Wiley a penny. Why not allow that? Wiley is surely aware of this interest, so it is there turn now to act.

Sunday, September 02, 2007

A JChemPaint Hack-a-thon

Niels and I held a JChemPaint hack-a-thon today (the IRC log). We had a quite ambitious agenda:
  1. make the renderer modular
  2. make the controller modular
  3. make a controller interface with Swing + SWT implementations

All this to make the JChemPaint editor module of the CDK more easily integrate with non-Swing widget environments. We achieved to make about 50% of these goals: the controller is now modular, and the Controller2DHub (soon going to deprecate the old Controller2D) no longer receives Swing mouse events, but local events by implementing the new IMouseEventRelay interface.

Controller modules implement the new IController2DModule interface. This modularization allows a clean up of CDK source code, making it more readable and easier to maintain. This was attempted in the past by setting up an AbstractController2D and a SimpleController2D. The new approach, however, allows to make separate modules for each rendering mode, which are independent anyway. The old code still needs to be ported to the new architecture, and this is expected to happen in the next two weeks.

Another clean up in the architecture is that the controller modules no longer directly act on the IChemModel, but use a new (badly named) IChemModelRelay interface, making the architecture more closely adhere to the MVC concept. The IChemModelRelay API currently contains only two methods, but this is expected to expend considerably, because all current JChemPaint edit actions will have to be passed via this interface.

If you want to give the new architecture a test run, look for the TestEditor application. At the time of writing, it uses a demo module, the DumpClosestObjectToSTDOUTModule, which dumps the nearest IAtom to STDOUT.