Wednesday, June 27, 2007

Chemical RDFa with Operator in the Firefox toolbar

December last year I proposed the use of microformats and RDFa for simple semantic markup of molecular information. I linked that with the InChI extension for the software for Chemical blogspace and wrote these tools to work with the markup:
All using the new semantic markup.

Of the two, I think RDFa has the best future. Then I discovered Operator, written by Mike. While the Greasemonkey scripts already allow me to link to, for example, PubChem and eMolecules, the Operator Firefox Addon allowed me to open vCards incorporated in HTML pages directly to my address book client. Thus, I could open chemistry directly in Bioclipse too!

That was the idea, at least. I contacted Mike, and he asked me to wait until the first 0.8 releases, which he announced earlier this month. This version allows user scripts to be written, which define how RDFa should be handled. And with his patience and help, this was the result:

The HTML is almost as explained before, and looks like:

<html xmlns="">

<h1>Chemical RDFa with Operator</h1>

<div about="#chem_123" xmlns:chem="">
Methane has the following identifier: <span property="chem:inchi">InChI=1/CH4/h1H4</span>


It is important here to wrap the statement in a <div> element and to add the @about attribute to it, defining the Subject. Moreover, you need to use the @property attributes instead of @class. The content of this attribute defined the Predicate, and the content of the <span> element is the Object, completing the RDF triple.

Operator detects these RDFa statements from the HTML, and creates a new menu item Search in Pubchem using this piece of code:
var pubchem_inchi = {
description: "Search in PubChem",
short: "PubChem",
scope: {
semantic: {
"RDFa" : {
property : "",
defaultNS : ""
doAction: function(semanticObject, semanticObjectType) {
if (semanticObjectType == "RDFa") {
return "" + semanticObject.inchi + "%22[InChI]";

SemanticActions.add("pubchem_inchi", pubchem_inchi);

You can reproduce this by installing Operator 0.8a in Firefox, saving the script to a file in your home directory, and reading it via the Operator "Options" dialog. Make sure to also set the Display Style in the General tab of the dialog to Data formats. Only then will the RDFa magic kick in.

Adding support for eMolecules, ChemSpider and whatever else we like is easy now. What I still need to explore (or ask Mike), is how I can trigger the Open With/Save As dialog of Firefox.

QSAR plugin for Bioclipse getting in shape

Over the last few weeks I continued the work on getting (descriptor-based) QSAR/QSPR implemented in Bioclipse. JOELib (GPL) and the CDK (LGPL) being two prominent opensource engines that can calculate molecular descriptors, and AMBIT a front-end.

To be able to do QSAR/QSPR model building from start to end in Bioclipse, I worked in April on an architecture for selecting descriptors. Being busy with so many things, it took me some time to get around to completing that, but here are the screenshots:

The funny characters and the whitespace is gone. Right now, it still only lists one provider, but I plan to add JOELib plugin soon. The list of actual descriptors is provided by the extension.

What Bioclipse then does, is have the extension calculate the descriptor values for the selected CDKResource in the BioNavigator using the selected descriptors. This will then create a new MatrixResource in the Bioclipse workspace (currently called qsarResult.jam), and which is opened in the Matrix editor:

There is still enough work left to do. For example, the columns are not yet labeled according to the descriptor name, and selecting more then one CDKResource in the navigator does not give a multirow matrix yet.

Monday, June 25, 2007

Test File Repository and RelaxNG

Last week I started the Blue Obelisk Chemical Test File Repository, a repository of OSI-approved-licenced test files (from various sources) to improve interoperability between chemoinformatics software.

Following a discussion on the mailing list earlier, a directory hierarchy has been set up, and each files contains an index.xml to describe the content. In case of a directory with actual test files, it may look like:

<dir name="asn/pubchem/valid" xmlns:dc="">


<file name="cid1.asn" valid="yes">
<test by="CDK"/>



As is clear, Dublin Core is reused for much of the meta data.

To improve and ensure some quality, the XML must be valid in addition to just well-formed, so that I can set up XSLT stylesheets to create XHTML indices and summaries. Therefore, I wanted to setup a schema for the index.xml files. My first thought was to use XML Schema which has XML Namespaces support and has well defined (and extensible) data types. I have hacked in it in the past my the details have slipped me. Already in 1998 I worked with DTDs, around the time that the XML specification was declared a recommendation. Originating from the SGML year, it is not XML based, had no knowledge of namespaces, and only a limited amount of data types.

Then there is RELAX NG. XML based, uses the same data types are XML Schema and has support for namespaces. Since I had to look up the specs for either DTD or XML Schema for the details anyway (e.g. on how to allow the DC namespace in the main namepsace), why not try something new. Well, I was amazed. RELAX NG has a syntax simplicity like that of DTD, but the functionality from XML Schema. So, I hacked up in 30 minutes a XML spec for the test file repository, including a (too short) list of recognized MIME types. Just a combination of some <element>, <attribute>, <oneOrMore>, etc elements. The results is available as schema.relaxng in SVN.

Nature should host our Electronic Lab Notebooks

Pedro suggested in Nature Networks What's Next forum that Nature should add a new service for scientists: hosting electronic lab notebooks. And I think this will be a killer application. I am rather excited about the idea, and feel ashamed not putting one-and-one together myself. We have our chemoinformatics tools and RDF is just around the corner, that combined with semantic wikis, and we have science of the 21st century. This is my reply posted on Nature Network:

Pedro, that might be an interesting idea: Nature hosting ELN. with much content, I have been maintaining a wiki in my previous postdoc, as replacement for the old paper notebook. Allows me to make links etc. I plan to do this in my new postdoc too, maybe even with a RDF-enabled wiki, to have agents automatically verify what I enter for inconsistencies. These things are already possible; just a matter of doing it.

If Nature would host such a service (RDF-enabled, and integrated with their other pages), they have a true killer for me: I write my ELN items, and for each page I decide if I want to make it public; since it is a wiki, I can keep it private until happy about the results, or, simply, until the experiment has finished. Then, by clicking a button it would become CC+attribution and automatically end up in Nature Preceedings. The full integration of Scintilla/Postgenomic/Connotea comes in when making links to background material.

The RDF is important for validating what I write, and I can imagine that Nature has an extensive set of default agents (of course, in addition to spell checking etc :). These agents check if the chemical reaction equations makes sense (conservation of mass, atom count, etc), that NMR/MS spectra and other experimental properties are consistent with that equation, and whatever else we can come up with. The tools for this validation are available, and basically only the glue is missing.

Friday, June 22, 2007

Archiving spectra: use InChI and CML

Ryan blogged in Archive This about some advices from ACD on how to store spectra in your electronic lab notebook.

Use InChI
This reminded me of a discussion I had with with Colin when he was at the CUBIC, which was about experimental sections. I proposed that the InChI should have a prominent place in the experimental section. An important argument for this is that it allows well-defined atom numbering to be used when writing down the NMR bits in that section: the InChI gives a unique numbering, so that the numbering used in the experimental section becomes author neutral. Because the InChI puts the carbons up front, the 13C NMR details get numbers from 1-13, or whatever the carbon count is. For proton NMR it is not difficult either, they are simply numbered according to the heavy atom to which they are attached. For situations where two hydrogens attached to the same heavy atom have different shifts, then a and b can still be used. The numbers are easily added to 2D diagrams anyway.

If software vendors (e.g. ACD and Bioclipse) and publishers (e.g. ACS, RSC, Chemistry Central) could adopt this proposal, then experimental sections immediately are better machine parsable and ready for automatic processing, such as discussed in my blog item Chemical Archeology: OSCAR3 to and by Christoph at the ACS meeting, available as PDF and this 18MB MP3.

Even better is to use CML for this, or CMLSpect to be precise (paper is accepted, and should appear soon). This XML-based language allows the full semantic markup of all the experimental details and all the interesting assignments you want to archive. I would like to challenge ACD to follow Bioclipse's lead and provide export as CMLSpect for spectral assignments and markup of experimental details, in addition to the PDF in whatever format they prefer. Cheers for the work by Tobias and Stefan on spectrum support in Bioclipse!

Tuesday, June 19, 2007

A new job: post-doc at the WUR on MS based structure elucidation

On July 1st I will start a post-doc in Wageningen, The Netherlands at the WUR. More precisely, with a post-doc in the group of Prof. Van Eeuwijk at Biometris, cooperating with the group of Prof. Hall at Plant Research International (PRI), within the framework of the new Netherlands Metabolomics Center. The topic will be structure elucidation using mass spectral data originating from the experimental department of PRI, and will be a nice follow up on the work on SENECA I have been doing last year in the group of Dr. Christoph Steinbeck at the CUBIC.

Quality of Chemical Database

Lately, Chemical blogspace has seen an interesting discussion on the quality of opendata and free chemical database (over 32 free resources now), such as the For example, see Antony's view on the NMRShiftDB and Robien's analysis.

Opendata makes such quality assurance possible, and I am happy that the NMRShiftDB was explored like this; the found problems can be reported and corrected. If correcting them upstream is difficult, opendata allows one to make a better derivative; that's what opendata is about. For example, BioMeta (DOI:10.1186/1471-2105-7-517) took data from KEGG and corrected a lot of molecular problems (like reaction balancing, stereo chemistry, etc).

I have contributed almost 900 spectra to the NMRShiftDB, and I am sure I may have made a mistake here and there. But my submission is verified by a reviewer, and furthermore, users of the database can report inconsistencies via the website. Now, I have focused on uncommon NMR nuclei, like 11B, 195Pt and 29Si (see the stats), which tend to have only one peak. Nothing much that can go wrong; still, one or two errors were catched by the reviewer.

Ensuring data quality
Humans make errors, but not even only when data is entered; they make mistakes checking data too. Nothing much that can be done about that, other than using computers to find patterns. This is exactly what Robien did: he used his software which implements common patterns to find entries in the database that did not comply to those patterns.

Automated quality assurance requires a easy to use, machine-readable interface. For example, CMLRSS (DOI:10.1021/ci034244p) can be used for running new entries in databases against known patterns. But other interfaces are most welcome too. Rich recently discussed the new PUG interface, which offers an interface to PubChem.

German scientists offer a RDF interface to Wikipedia: DBPedia. Informal semantic markup in Wikipedia, such as the Infobox template, are used to create triples. It's a shame that the ChemBox is not used yet, which would make detecting molecules in blogs even easier.

Monday, June 18, 2007

Using Wikipedia to recognize Molecules in Blogspace

Only few people are using InChI's to indicate the molecules the blog about (prominent exceptions are Useful Chemistry and Molecule of the Day). Consequently, the number of detected molecules (without using OSCAR3) in Chemical blogspace has been low.

Fortunately, many more people use links to Wikipedia to identify the molecules that talk about. And some of these pages use the ChemBox template which actually might contain a PubChem CID or even an InChI. This has increased the molecular content of Chemical blogspace considerably.

There is also, however, a good list of molecules in Wikipedia for which no CID or InChI is given: -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID -> but no InChI/CID

I really would like to start adding InChI's for these molecules to Wikipedia, but someone needs to enlighten me about the state of ChemBox? Can the InChI be added to the template, or should the InChI be given elsewhere on the page? Adding such small bits is easier than writing a full entry.

Saturday, June 16, 2007

Payed summer jobs in chemoinformatics

Last year the sponsored one summer student to work on Bioclipse (see the announcement). The Programmeerzomer is much like the Google Summer of Code where I mentor Alexandr. However, it is much smaller and oriented at just the NL area: both the student and the mentor needs to be Dutch, but the opensource project does not.

Rob worked last year on a Ghemical plugin for Bioclipse (see this interview in Dutch). The architecture for doing calculations (the Compute plugin) is still being used within several other plugins. This year I got assigned two students: one for Bioclipse and one for Jmol.

I have no idea at this moment what ideas the students picked from the lists in the wikis (see the Jmol project idea and Bioclipse idea lists). There is a meeting scheduled in the 25th.

The ideas include:

If you have a suggestion, it would be much appreciated if you can add that to the wiki pages linked above. Make sure to leave a comment to this blog item too, announcing the new idea!

Sunday, June 10, 2007

Janocchio: Jmol and CDK based 1H coupling constant prediction

While looking up a reference for FirstGlance in Jmol, I found Janocchio, a CDK and Jmol based tool for prediction of coupling constants, recently published in Magnetic Resonance in Chemistry. It's written by Evans, Bodkin, Baker and Sharman (from Eli Lilly) and licensed LGPL. It is one of those rare contributions of pharmaceutical industry, and I can only deeply appreciate this contribution.

A quote from the article:
It was therefore decided to create a Java application and applet,
‘JAva NOe and Coupling Calculator with Handy Interactive Operation’
(Janocchio), using the open source libraries of the molecular viewer Jmol
and the Chemical Development Kit (CDK). It aims to provide a simple and
intuitive way to calculate both the NOEs and couplings.

Release 1.0.1 of last May uses an old Jmol, and the CDK release from 26 August 2005. A bit outdated, and I am wondering if it would be a lot of work to integrate this into Bioclipse. Maybe a summer job?

Saturday, June 09, 2007

Preprint servers: the CPS failed, how will Nature Precedings do?

Some 7 years ago, following successes in physics, launched the Chemistry Preprint Server (CPS), and Warr evaluated it in a JCIM article three years later. She wrote about 'lessons learned', but the only one seemed to have been that chemistry was not ready for it, as the project shutdown in 2004. The archives are still available, fortunately, and you may find it amusing to look up my or some other submission.

Now, Nascent wrote that Nature is setting up Nature Precedings, which was earlier noted by Pedro. The official announcement was published as an editorial in Nature. This being a Nature initiative, and not focused on just chemistry, I am sure it will do better than CPS. BTW, media coverage is tracked in a social way.

I might request an test account; I do have an old half-finished manuscript that I never got around to finishing. While still relevant, it could use some community input; this preprint server would be the perfect tool. That's how my first manuscript ended up on CPS too :)

Friday, June 08, 2007

Scientific Literature: searching, ranking, storage

Dealing with scientific literature has been one important theme in Chemical blogspace. For example, ranking articles and how to store your personal PDF archive has been topics of discussion. In this blog I will summarize bits of the discussion, and my personal view on things.

Searching literature is traditionally done in systems like Chemical Abstracts and Web-of-Science. The open nature of a growing number of repositories (e.g. the Dutch DARE) and indexing facilities like PubMed make these proprietary tools obsolete.

It is incorrect to assume that these payed services are the only trustworthy sources. Even WoS fails to make the all links between entries in the database. For example, I am aware of two missing citations to articles I have written, even though both the cited and the citing article is available in the system. One of the citing articles was in the Angewandte Chemie!

Additionally, some search services, like Google Scholar, have the advantage that they find copies and close variants of articles in proprietary articles on home pages and in open repositories. Today, I learned about Scientific Commons which indexes and links to a staggering 1.5M publications, using, among others, PubMed and university repositories. Where possible it makes direct links to PDF versions of the article.

Mitch set up ChemRank, to which Peter, the ChemBlog and I replied. Afterwards, I learned that other services are available too, that allow, in addition to setting up an online personal literature database, voting and commenting on articles.

Apparently, CiteULike (CUL) supports this too. In contrast to ChemRank, CUL requires a login, which I personally see as an advantage, because I can browse literature bookmarked by other accounts I trust. There is also Connotea but I never liked that site that much (e.g. is allows bookmarking any web page); Rich has his comments too. I would also like to mention BioWizard which is based on the PubMed content, which actually covers a good deal of chemistry literature nowadays too.

Local Storage
These above mentioned systems can be used as alternative to offline bibliographic database systems, like EndNote and JabRef. The latter is my favorite, being based on BibTeX which I use for my LaTeX based publications, and is opensource and contains a few patches from yours truly. Jungfreudlich wondered how people organized their PDF archive and I commented how I do it:
  • a directory hierarchy based on journal name and year
  • file names that include last name of the first author and year
  • JabRef for the bibiographic database
  • Strigi for full text search
Jörg and the power of goo replied too.

I have accounts on several online tools now (with some duplication which I don't like), and I have no idea which of the options will stay around. Time will learn. Good news is that the open characters of many of these allow making mashups, and generally integrate tools. For example,
JabRef allows downloading citations from PubMed, and Noel suggested to use Greasemonkey scripts to link to the supplementary information for his articles, instead of using the mechanisms journals have. I can see the advantage of this, as, for example, Wiley takes full copyright of the data in SI material, while Noel's mechanism would keep the data open.

For now, however, I would very much like to see a meta service where I can query rankings and comment for articles using any or all of the above tools.

Tuesday, June 05, 2007

A Blue Obelisk corner in Chemical Blogspace

I just finished setting up a Blue Obelisk section for Chemical blogspace, as future replacement for the current Planet Blue Obelisk (unless someone wants to take over that webpage). The only thing really missing is a RSS feed for recent posts for just the Blue Obelisk member blogs (BTW, just email me if you want to be listed as BO member with your blog too; the BO community is very open!).

For now, you will have to do with this page:

An additional flaw is that it also shows molecules for other blogs.

Update: the RSS feed for a specific category was already available, but just not from the FireFox URL bar. Instead, it is given on the right side of the posts page when you selected a category. Here a shortcut for the RSS for posts from the Blue Obelisk category.

Sunday, June 03, 2007

Finding email with Strigi in .tar backups

Now that my CUBIC desktop machine is shutting down, I made the necessary backups, among a mail.tar for my mail correspondence of about a year. About 500MB in size for almost 8700 files. Strigi is a perfect tool to help me find messages in this archive, as it will recurse into the .tar archive, and even into email attachements. I created an index just for the archive with:
strigicmd create -t clucene -d index/ mail.tar
It took Strigi about 30 seconds to index the whole archive. That's good performance!

Now, Strigi indexes content full text, but also uses a controlled vocabulary (among which one specifically for chemistry). So I can search for email messages which have article in the subject with:
strigicmd query -t clucene -d index/ email.subject:article

However, From: and To: content was not yet extracted. That was easily patched. This allows me to find correspondence between me and, for example, Christoph:
strigicmd query -t clucene -d index/ AND email.from:Egon