Wednesday, October 31, 2007

Offline CDK development using git-svn

While Subversion is a signification improvement over CVS, they both require a central server. That is, they do not allow me to commit changes when I am not connected to that server. This is annoying when being on a long train ride, or somewhere else without internet connectivity. I can pile up all my changes, but that would yield one big ugly patch.

Therefore, I tried Mercurial where each client is server too. The version I used, however, did not have the move command, so it put me back into the old CVS days where I lost the history of a file when I reorganize my archive.

Then Git, the version control system developed by Linus Torvalds when he found that existing tools did not do what he wanted to do. It seems a rather good product, though with a somewhat larger learning curve, because of the far more flexible architecture (see this tutorial). Well, it works for the Linux kernel, so must be good :)

Now, SourceForge does not have Git support yet, so we use Subversion. Flavio of Strigi fame, however, introduced me to git-svn. Almost two month ago, already, but finally made some time to try it out. I think I like it.

This is what I did to make a commit to CDKs SVN repository:

$ sudo aptitude install git-svn git-core
$ mkdir -p git-svn/cdk-trunk
$ cd git-svn/cdk-trunk
$ git-svn init
$ git-svn fetch -rHEAD
$ nano .classpath
$ git add .classpath
$ git commit
$ git-svn dcommit

The first git-svn command initializes a log Git repository based on the SVN repository. The git-svn fetch command makes a local copy of the SVN repository content defined in the previous command. Local changes are, by default, not commited; unless one explicitly git adds them to a patch. Once a patch is ready you can do all sorts of interesting things with them, among with commit them to the local Git repository with git commit.

Now, these kind of commits are on the local repository, and I do not require internet access for that. When I am connected again, I can synchronize my local changes with the SVN repository with the git-svn dcommit command.

A final important command is git-svn rebase, which is used to update the local git command for changes others made to the SVN repository.

Monday, October 29, 2007

BioSpider: another molecule search engine

I just ran into BioSpider. Unlike ChemSpider, BioSpider crawls the internet (well, this list of sources really) to find information, and depending on what it finds it continues the search. Below is a screenshot of an intermediate point after starting with the InChI of methane:

After the search it generates a long HTML page with all the information it found on the molecule you queried for. This approach is much more scalable than storing all in one database.

This crawling of information is something I was working on myself a bit too, and I think this is a good approach. However, I think the use of a central website is not the right approach. Instead, the search should be distributed too: the crawling should be done on the client machine; it should be done in Taverna or Bioclipse instead.

My conclusion: excellent idea, bad implementation.

Friday, October 26, 2007

My FOAF network #1: the FOAFExplorer

In this series I will introduce the technologies behind my FOAF network. FOAF means Friend-of-a-Friend and
    [t]he Friend of a Friend (FOAF) project is creating a Web of machine-readable pages describing people, the links between them and the things they create and do.

My FOAF file (draft) will give you details on who I am, who I collaborate with (and other types of friends), which conferences I am attending, what I published etc. That is, I'll try to keep it updated. BTW, FOAF is a RDF language.

Pierre has done some excellent FOAF work in the past, and developed the MyFOAFExplorer, and also developed a tool to create a FOAF network based on the PubMed database, called SciFOAF. The latter is neat, but does not allow putting all this personal details in the FOAF files. However, the output could be a starting point.

Back to FOAFExplorer, this is what the FOAFExplorer shows for my network:

I'm a bit lonely, even though I have linked to two friends in my FOAF file, of which one has a FOAF file too (Henry):
<foaf:Person rdf:ID="HenryRzepa">
<foaf:name>Henry Rzepa</foaf:name>
<rdfs:seeAlso rdf:resource=""/>
<foaf:Person rdf:ID="PeterMurrayRust">
<foaf:name>Peter Murray-Rust

I guess the FOAFExplorer does not browse into my network. More on that in later items in this series.

Wednesday, October 24, 2007

One Billion Biochemical RDF Triples!

That must be a record! Eric Jain wrote on public-semweb-lifesci:

    The latest release of the UniProt protein database contains just over a
    billion triples*! PRESS RELEASE :-)

    The data is all available via the (Semantic or otherwise) Web:

    ...or can be bulk-downloaded from:

    * Counting some reification statements, and assuming no overlap between
    "named graphs".

    P.S. This should be the last you'll hear from me on this topic -- I'm off
    to new adventures...

I surely hope this is not the last we hear of this huge RDF collection.

My blog turned 2

A bit over two years I posted my first blog item, Chem-bla-ics, introducing the topic of my blog. In January this year I explained why I like blogging.

Friday, October 19, 2007

Bob improved the POV-Ray export of Jmol

Bob has set up a new interface between the data model and the Jmol renderer, which allows him to define other types of export too. One of this is a POV-Ray export, which allows creating of high quality images for paper. Jmol has had POV-Ray export for a long time now, but never included the secondary structures or other more recent visual featues. PyMOL is well-known for its POV-Ray feature, and often used to create publication quality protein prints. The script command to create a POV-Ray input file takes the output image size as parameters:
write povray 400 600   # width 400, height 600

Here's a screenshot of a protein with surface:

And here a MO of water:

Note the shading. More examples are available here.

Thursday, October 18, 2007

More QSAR in Bioclipse: the JOELib extension

I added a Bioclipse plugin for JOELib (GPL, by Joerg) which comes with many QSAR descriptors, several of which are now available in the QSAR feature of Bioclipse:

Meanwhile, the Bioclipse team in Uppsala has set up the obligatory scatter plot functionality, but leave that screenshot for them to show. Therefore, time for integration with R.

Open Data Misconception #1: you do not get cited for your contributions

The Open Data/ChemSpider debate is continuing, and Noel wondered in the ChemSpider Blog item on the Open Data spectra in ChemSpider. The spectra in ChemSpider come from four persons, two of which released their data as Open Data (Robert and Jean-Claude) and two as proprietary data.

One of the two is Gary who expressed his concerns in the ChemSpider blog that people would not cite his contributions if he would release the data as Open Data:
    In principle, someone could download an assortment of spectra for a given molecule, calculate some other spectra, and then write a paper without ever recording a single NMR spectrum of their own. Would they then include the individual who deposited the spectra as a co-author or even acknowledge the source of the spectra that they used? Who knows.

It is a misconception that releasing your Open Data will cause a situation that your scientific work is not acknowledged (citing statistics is the crude mechanism we use for that). First of all, using results without acknowledgment is called plagiarism (which is ethically wrong by any standard). But this is not a feature of Open Data, it is found in any form of science. Recall Herr Schön.

Some months back I advised an other chemical database who had similar concerns, and I pointed the owners, like I commented to Gary, to the CC-BY license which has an explicit Attribution (BY) clause:
    Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

Using this license, plagiarism would not even just be (scientifically) unethical, it would be illegal too, because it would brake the license agreement. This even allows one to bring the case to court, if you like. (BTW, I was recently informed that the database had switched to the CC-BY license!)

Tuesday, October 16, 2007

Lunch at Nature HQ (with Euan, Joanna, Ian and Ålf)

On my way back from the Taverna workshop I visited Nature HQ, as Ian reported about on Nascent. It was a (too) short meeting, but very nice to meet Euan (finally; he wrote the software which I use for Chemical blogspace), Joanna (whom I met in Chicago already, where she had two presentations, and is responsible for Second Nature), Ian (who works on Connotea, and commented on my tagging molecule blog) and Ålf (who works on Scintilla) and briefly Timo (who rules them all). BTW, I had a simple but delicious pasta.

First, let me note that if I would have to name a favorite molecule, and it was acetic acid, not ascorbic acid. Reason why it would be my favorite is acetic acid was the first organic molecule I put in the Woordenboek Organische Chemie in 1995.

We discussed a number of things, regarding the things we do. One of these was tagging molecules. Ian used instead of The first was not yet picked up by but I fixed that.

We also discussed linking molecular structures with scientific literature. The discussions in blogspace of this week show that doing that by using computer programs is not appreciated by publishers (see here, here, here, here, here, and here) (The publishers seem to prefer to like to send of a PDF to India or China.)

I proposed that the InChI would be part of the publication, for all molecules mentioned in the article. If a journal can require exact bibliography and experimental section formats, they can certainly require InChIs too. There are few programs left which cannot autogenerate an InChI, and the chemists draws the structures anyway. However, the software used in the editorial process does not support linking InChIs with a PDF (if that software would have been opensource ...).

So, the best current option seems to be social tagging mechanisms, and this is what we talked about. Just use Connotea (or any other service) and tag your molecule with a DOI:


This tagging is done manually. No machines involved in that. Nothing the publishers can do about this. No ChemRefer needed. But this will allow us to start building a database with links between papers and molecules, which we badly need. BTW, we will not have to start from scratch. The NMRShiftDB already contains many links, which is open data!

Now, you might notice the informal semantics of the doi: prefix. That's something I hereby propose, as it allow services to pick up the content more easily. You might also note the incorrect DOI in Connotea. The reason for that is that Connotea does not yet support a '/' in a tag. I reported that problem.

ChemSpider: the SuSE GNU/Linux of chemical databases?

A molecular structure without any properties in meaningless. Structure generators can easily build up a database of molecules of unlimited size. 30 million in CAS, 20 million in ChemSpider or 15 million in PubChem is nothing yet. The value comes in when linking those structures with experimental properties.

Now, chemical industry, academia and publishers have done there best in the past 50 years to maintain such databases, and decided that a commercial model was the best option to maintain such databases. This was true 50 years ago, but no longer is. ICT has progressed so much that a 20M database can be stored on a local hard disc, or site repository anyway. Moreover, and more importantly, creating a database like this is much cheaper now. These ICT developments threaten the stone age chemical databases around now. Current approaches can easily build cheap and Open chemical databases; if we only all wanted.

ChemSpider is attempting to set up the largest free chemical database, by mixing both Open data, as well as proprietary data. As such, they are attempting to achieve what SuSE and other commercial GNU/Linux distributions are trying to do: create a valuable product by complementing Open data with proprietary data when that adds value. That is, I think they are doing this. SuSE, for example, includes proprietary video drivers. ChemSpider, for example, contains proprietary molecular properties computed by ACD/Labs software (BTW, some of which can be done with Open tools too, as I will show shortly.)

Now, this poses quite a challenge: different licenses, different copyright holders, requirements to provide access to the source (for the Open data), etc, all in one system. Quite a challenge indeed, because ChemSpider is now required to track copyright and license information for each bit of information. GNU/Linux distributions do this by using a package (.deb, .rpm) approach. And, the sheer size of the database poses strong requirements if people start downloading the whole lot.

ChemSpider has had their share of critique, but the are learning, and trying to find to set up a sustainable environment for what they want to do. That might involve a revenue stream from clients if there is no governmental organization, academic institute or some society stepping in to provide financial means. A valid question would be why the did not set up a non-profit organization. But neither did SuSE, RedHat and Mandriva, but that has not stopped those from contribution to Open source.

I have no idea where ChemSpider will end up (consider that a request for a copy of the full set of Open Data), but am happy to help them distribute Open data, and even help them replace proprietary bits with open equivalents, which I'm sure the are open too. With respect to proprietary bits the are redistributing, I understand they can only relay the ODOSOS message to the commercial partners from which they get those proprietary bits, and hope they are doing. ChemSpider has the great opportunity to show that releasing and contributing chemical data as Open Data does not conflict with a healthy self-sustainable business model.

Sunday, October 14, 2007

CompLife2007, Utrecht/NL. Day 1 and 2

CompLife 2007 was held 1.5 weeks ago in Utrecht, The Netherlands. The number of participants was much lower than last year in Cambridge. Ola and I gave a tutorial on Bioclipse, and Thorsten one on KNIME. Since a visit to Konstance to meet the KNIME developers, I had not been able to develop a KNIME plugin, but this was a nice opportunity to finally do so. I managed to do so, and wrote up a plugin that takes InChIKeys and then goes of the ChemSpider to download MDL molfiles:

Why ChemSpider? Arbitrary. Done PubChem in the past already. Moreover, ChemSpider has the largest database of molecular structures and in that sense important to my research.

Why KNIME? Played with Taverna in the past, and expect to do much more work on Taverna in the coming year (see also this and this). Moreover, KNIME got a CDK plugin already, and the KNIME developers contributed valuable feedback to the CDK project in the last year. It was about time that I contributed something back, though the current functionality is quite limited. KNIME has a better architectural design than Taverna1, but will face though competition with Taverna2, due next year.

The presentations
Heringa gave a presentation on network analysis, and discussed the scale-free network, hub nodes, etc, after which he gave an example on the 14-3-3 PPI family which both have promoting and inhibiting capabilities. Fraser presented work on improving microarray data analysis, by reducing non-random background noise. Schroeter presented the use of Gaussian process modeling in QSAR studies, which allows estimation of error bars (see DOI:10.1002/cmdc.200700041. I did not feel the results were very convincing, though, but the method sounds interesting. Larhlimi presented research on network analysis of metabolic networks. His approach finds so-called minimal forward direction cuts, which identifies critical parts in the network if one is interested in repressing certain metabolic processes. Hofto presented some work on the use of DFT for proteins, and picked up that one has to do things critically to be able to reproduce binding affinities. Combinations of DFT or MM with QM are becoming popular to model binding sites. Van Lenthe presented such an approach of the second day of CompLife.

By far the most interesting talk at the conference, was the insightful presentation by Paulien Hogeweg. She apparently coined the term bioinformatics. Anyway, she had a exciting presentation on feed-forward loops in relation to evolution, and showed correlation between jumps in FFL motifs with biodiversity. She also warned us for the Monster of Loch Ness syndrome, where computational models may indicate large underlying processes, which are not really existing. But that should be a problem that most of my readers should be aware of. She introduced evolutionary modeling, to put further restrictions on the models, to reduce the chance of finding monsters.

Hussong had an interesting presentation too, if one is interested in analysis of GC/MS or LC/MS data. He introduced a hard-modeling approach for proteomics data using wavelets technology. His angle on this was to use a wavelet that represents the isotopic pattern of a protein mass spectrum. Interestingly, the wavelet had negative intensities, something which one will never find in mass spectra. However, I seem to recall a mathematical restriction on wavelets that would forbid taking the squared version of the function. He indicated that the code is available via OpenMS.

Jensen, finally, presented his work at the UCC on Markov models for protein folding, where he uses the mean first passage time as observable to analyze of processes in folding state space. This allows him to compare different modeling approaches and, for example, to predict how many time steps are needed to reach folding. Being able to measure characteristics of certain modeling methods, one is able to make a objective comparison. Something which allows a fair competition.

Why ODOSOS is important

I value ODOSOS very high: they are a key component of science, and scientific research, though not every scientist sees these importance yet. I strongly believe that scientific progress is held back because of scientific results not being open; it's putting us back into the days of alchemy, where experiments were like black boxes and procedures kept secretly. It was not until the alchemists started to properly write down procedures that it, as a science, took off. Now, with chemoinformatics in mind, we have the opportunity to write down our procedures in high detail.

I keep wondering what the state of drug research would be, if the previous generation of chemoinformaticians would have valued ODOSOS as much as I do. Now, with a close relative being diagnosed last week with a form of cancer with low five-year survival rates, I can not get more angry about those who want to make (unreasonable) money by selling scientific research. A 1M bonus is unreasonable. I can have 10 post-docs work on chemoinformatics research for the same period; I can have them work on drug design for various kinds of cancer.

Therefore, I will continue to use every opportunity to convince people of ODOSOS, and will continue to develop new methods to improve accurate exchange of scientific data and experimental results. I will help people where I can to distribute open data, even if the whole project is not 100% ODOSOS. For example, the Chemistry Development Kit is open source itself (LGPL) which does allow embedding into proprietary software. This does not mean that I will contribute to the proprietary software, and actually am proud not having done so in the last 10 years.

I will continue to advice people how to make their work more ODOSOS, even if they cannot make the full transition. I will also continue to make sure that all my scientific results are ODOSOS, as there is no other kind of science. To set a good example, and, hopefully, to lead the way.

This is why I am a proud member of the Blue Obelisk.

Monday, October 08, 2007

Taverna Workshop, Day 1 Update

The second part of the morning session featured a presentation by Sirisha Gollapudi which spoke about mining biological graphs, such as protein-protein interaction networks and metabolic pathways. Patterns detection for nodes with only one edge, and cycles etc, using Taverna. An example data she worked on is the Palsson human metabolism (doi:10.1073/pnas.0610772104); she mentioned that this metabolite data set contains cocaine :) Neil Chue Hong finished with an introduction on the OMII-UK which is co-host of this meeting.

After lunch Mark Wilkinson introduced BioMoby, which we actually use in Wageningen already. I have tried to use jMoby to set up services based on the CDK, but failed sofar. Will talk with Mark on that. Next was my presentation, and I spoke about CDK-Taverna, Bioclipse and some peculiarities with chemoinformatics workflow, like the importance with intermediate interaction, the need to visualize the data and complex, information rich data. Bioclipse is seeing an integration of BioMoby and of Taverna.

After the coffee brake Marco Roos spoke about myExperiment and his work on text mining. I unfortunately missed this presentation, as I was meeting with people from the EBI who work on the MACiE database (see this blog item).

A discussion session afterwards introduced a few more Taverna uses, and encountered technical problems. Taverna2 is actually going to be quite interesting, with a data caching system between work processors, and a powerful scheme of annotation of processors, which will allow rating, finding local services, etc. More on that tomorrow. Dinner time now :)

Taverna Workshop, Hinxton, UK

I arrived at the EBI last night for the Taverna workshop, during which the design of Taverna2 is presented and workflow examples are discussed. Several 'colleagues' from Wageningen and the SARA computing center in Amsterdam are present, along with many other interesting people. This afternoon is my presentation.

Paul Fisher just presented his PhD work on using workflows to improve the throughput of QTL matching against pathway information and phenotype. One interesting note was its function to make biological informational studies more reproducible. He had getting the versions of online databases explicitly in the workflow, so that it gets stored in workflow output.

Monday, October 01, 2007

How the blogosphere changes publishing

Peter is writing up a 1FTE grant proposal for someone to work on the question how automatic agents and, more interestingly, the blogosphere are changing, no improving, the dissemination of scientific literature. He wants our input. To make his work easy, I'll tag this item pmrgrantproposal and would ask everyone to do the same (Peter unfortunately did not suggest a tag himself). Here are pointers to blog items I wrote, related to the four themes Peter identifies.

The blogosphere oversees all major Open discussion

The blogosphere cares about data

Important bad science cannot hide
I do not feel much like pointing to bad scientific articles, but want to point to the enormous amount of literature being discussed in Chemical blogspace: 60 active chemical blogs discussed just over 1300 peer-reviewed papers from 213 scientific journals in less than 10 months. The top 5 journals have 133, 78, 68, 57 and 48 papers discussed in 22, 24, 10, 11 and 18 different blogs respectively. (Peter, if you need more in depth statistics, just let me know...)

Two examples where I discuss not-bad-at-all scientific literature:
Open Notebook Science
I regularly blog about the chemoinformatics research I do in my blog. A few examples from the last half year:

Update: after comments I have removed one link, which I need to confirm first.