Friday, February 23, 2007

Nature Network v2: cannot create a new group

Nascent reported that Nature Network v2 has gone life. Never too anxious to try something new, I created an account and signed in. I even joined two groups: Bioinformatics and Semantic Web for the Life Sciences.

But, when I tried to create a new group, the system fails. I promised me to send me email for confirmation. Tried it twice via my Sourceforge email account. No email. I then changed my email for my Nature account to my Gmail address. Still no email...

I am not located in Boston or London, is that the problem? Is being 'global' not good enough? Is the requirement to have two 'o's in the name? Cologne then, maybe?

(Missing) Features

For the rest, the system seems interesting. I am not too fond of having to create accounts all over the place (what was the password again???), but looks promising. The thing I missed most when filling out my profile was a feature to import the list of my publications from Connotea.

Another thing I missed, was the ability to mention my blog(s) in my profile. May I put this in as request too? BTW, is there a group or forum on Nature Network where I can file these things?

Tuesday, February 20, 2007

Invisible InChI's

Some InChI's are short, such as that for methane: InChI=1/CH4/h1H4. Others are long (think crambin), and you don't want to show them inline. Or you just want to show them anyway, but still want the chemistry to be understood. Here come the invisible InChI's.

Alt text for images

One solution is to put the InChI as content of the @alt attribute of the HTML <img> element. This has the downside that it has no explicit semantic meaning. For example, the Molecule Of The Day blog is using this approach. It's an excellent start, but not the solution.

As Keyword

Another option is to put it in as keyword, in the HTML <head> element: <meta name="keywords" content="InChI=1/CH4/h1H4"/>. But Google does not index this, so the use is restricted.

Invisible text

The most promosing alternative, however, is to put it in using the <span> element, in combination with microformats or RDFa, Like this: . It does not show up, does it? But it is really there, as you would see, if you have the special Greasemonkey script installed.

This is the HTML code for this example:
<span class="chem:inchi" style="font-size: 0%; visibility: hidden;">InChI=1/CH4/h1H4</span>

The @style attribute marks the text's visibility as hidden, and the font-size is set to 0%. It is important not to set it to zero itself, because many web browsers do not interpret zero font size correctly, and take the default font size instead.

This should solve the standing problem that we would like to include the InChI's in our blogs, if it would just not be so long and unreadable. Just hide it.

Update: Daniel informed me that Google won't index text marked 'visibility: hidden' and may even mark your webpage as spam :( Not the solution either. Read the comments for more thoughts.

Monday, February 19, 2007

Pimp my JavaDoc

Jörg's PhD book Data Mining und Graph Mining auf molekularen Graphen - Chemoinformatik und molekulare Kodierungen für ADME/Tox-QSAR-Analysen has a dump of the JavaDoc of the GroupContributionPredictor in JOELib (Figure 3.2, page 43). There are two nice things to the shown JavaDoc: 1. it has links to wikipedia; 2. it has a Further Reading section.

Now, the CDK already links to a bibliography for some time now. However, it would just give a BibTex key, and link to a webpage created from a BibTeXML file in which we store all references (cdk/doc/refs/cheminf.bibx). Putting the full citation inline makes the JavaDoc more informative, but I wanted to preserve the @cdk.cite mechanism we were using.

This weekend I hacked up a nice CDKCiteDoclet that would read the BibTeXML file with XOM, and convert items to HTML to put into the pimped JavaDoc:

Saturday, February 17, 2007

Is that Jmol in that D-Wave demo?

Slashdot reported on D-Wave's recent demo of their 16-qubit quantum computing system. Video's of the demo can be watched on Google Video. The second video demonstrates the use of the machine in similarity searching:

Now, that screenshot does look like Jmol. The companies website does not give the answer, though Scott mentions C and Java front end software.

So, let's ask the source: Dear dr. Rose, is it Jmol what we see in that demo?

Sunday, February 04, 2007

Writing up my PhD introduction chapter...

The last twelve months or so, I have been doing two jobs (excluding hobbies of mine, such as Chemical blogspace): my postdoc in the group of Christoph Steinbeck on computer aided structure elucidation, and finishing my PhD. The topic of my PhD is about the interplay between chemoinformatics and chemometrics: the first being strong in dealing with molecular structures, the latter strong in data analysis and mining, originally on experimental data. Really, I focused on a few existing problems, such as how to represent and analyze large libraries of crystal structures, the use of NMR spectra in QSAR studies, and two more practical problems regarding reproducibility of scientific results, which includes communication of data, and transferability of algorithms. Actually, I also studied fragment mining in QSAR for a set of transfactants, but that has not lead to firm results yet.

The below diagram shows how I see the interplay between both fields:

Saturday, February 03, 2007

CDK Workshop - Days #3 and #4

Days #3 and #4 of the CDK Workshop have been quite busy indeed, and I have not been able to summarize them so far. After a rather interesting day #2, the third day was the last one with scheduled presentations. Kai Hartmann showed how he used the CDK in his systems biology research, and contributed the code he wrote to predict Gibbs energies based on fragment contributions. Miguel Rojas showed his MS prediction work, which is based on the CDK too.

Much of the rest of day and Thursday continued on the work started yesterday: making the 3D structure builder a singleton class, and applying and testing an optimization for the AllRingsFinder to address molecules like Choloyl-CoA. The trick basically consists of applying the all rings finding algorithm to isolated systems only. The effect is considerable: the total computation time for Choloyl-CoA decreases by a 93 fold! We found that the fingerprints used in the template library for the
3D structure builder are outdated, and Christoph worked on updating that, which required searching into old archives to find the tool to do just this.

Because the above performance fix did not fix the current slow SMILES parsing, Kai looked at the DeduceBondOrderTool which is the slow component, and optimized the used algorithm by reusing determined molecular ring systems. Nevertheless, on users requests, a time out mechanism is now available for SMILES parsing. Additionally, several of the bugs found on the second workshop day have been fixed. Meanwhile, I was distracted by other things. For example, fixing Bioclipse bugs for the version 1.0.1 released yesterday. The SENECA tool is not forgotten too, and last weekend I made some good progress with it, which Christoph blogged about.

Thursday, February 01, 2007

RSC: the first publisher to go semantic!

Just announced: the RSC goes semantic! Colin Batchelor was here at the CUBIC last autumn, where we discussed issues involved, mostly relating to experimental section of organic chemistry syntheses, and NMR and MS spectra in particular, so I knew that this was coming our way. The announcement writes:
RSC Publishing, the publishing arm of the Royal Society of Chemistry, is
pleased to announce a new initiative for its journals. From February
2007 electronic RSC journal papers will be enhanced so that their data
can be read, indexed and intelligently searched by machine, a first step
towards the "semantic web".

Readers will be able to click on named compounds and scientific concepts
in an electronic journal article to download structures, understand
topics, or link through to electronic databases; compounds and ontology
terms will be published as RSS feeds enabling automated discovery of
relevant research.

The initiative, coined 'Project Prospect', is the first of its scope
from a primary research publisher. Developed together with UK academics
based at the Unilever Centre of Molecular Informatics and the Computing
Laboratory at Cambridge University, the Project uses InChIs (IUPAC's
International Chemical Identifier for compounds); OBO ontology terms
(Open Biomedical Ontologies: a hierarchical classification of biomedical
terms) such as the Gene Ontology (GO) and the related Sequence Ontology
(SO); terms from the IUPAC Gold Book; and CML (Chemical Markup Language:
a means to describe molecular information in a structured form).

This is a completely free service for authors and readers of RSC
journals. The enhanced articles have an at a glance HTML view with
additional features accessed by a tool box. Downloadable compound
structures and printer friendly versions will be available via this new

Colin, cheers!