Pages

Saturday, December 30, 2006

Modern chemistry in the CDK: beyond the two-atom bond

Rich recently blogged about the limitations of the two-atom bond representation often used in chemoinformatics, triggered by the four ferrocene entries in PubChem. In reply to himself, Rich described FlexMol, an XML language that can describe bond systems that involve more than two atoms.

Obviously, the problems originates from the lack of mathematical knowledge of chemists: the current chemoinformatics heavily depends on graph theory, where each atom is a vertex and each bond an edge. This has the advantage that we can borrow all algorithms that work with graph representations, such as Dijkstra's algorithm to find the shortest path between two vertices. Or, in chemical language, an algorithm to calculate how many bonds two atoms are apart in a molecule.

When discussing FlexMol, Rich mentions the work by Dietz (DOI:10.1021/ci00027a001), but I would like to mention the PhD thesis of S. Bauerschmidt to this (see DOI: 10.1021/ci9704423) done in Gasteiger's group.
Dropping this 'two-atom bond' representation in favor of something that better describes compounds like ferrocene, like the Dietz and Bauerschmidt approaches, has the unfortunate disadvantage of loosing compatibility with graph theory algorithms. Nevertheless, in order to take chemoinformatics to the next level, we have to address these issues. But hope is not lost, and people are working on rewriting our toolkit of chemoinformatics algorithms to match such new representations.

CDK

In will postpone analyzing the CDK for compatibility with such more modern representations (look out for a CDK News article), and now just describe how the CDK can be used for FlexMol/Dietz/Bauerschmidt representations. Consider the four examples Rich gives in his blog. Here are the CDK ways of doing the same.

For example, 1,3,5-cyclohexatriene:

public IMolecule makeCycloHexaTriene() {
IMolecule cyclohexatriene = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);
IAtom atomC5 = builder.newAtom(Elements.CARBON);
atomC5.setID("C5"); atomC5.setHydrogenCount(1);

IBond bondB0 = builder.newBond(atomC0, atomC1, 1.0);
bondB0.setElectronCount(2);
IBond bondB1 = builder.newBond(atomC1, atomC2, 2.0);
bondB1.setElectronCount(4);
IBond bondB2 = builder.newBond(atomC2, atomC3, 1.0);
bondB2.setElectronCount(2);
IBond bondB3 = builder.newBond(atomC3, atomC4, 2.0);
bondB3.setElectronCount(4);
IBond bondB4 = builder.newBond(atomC4, atomC5, 1.0);
bondB4.setElectronCount(2);
IBond bondB5 = builder.newBond(atomC0, atomC5, 2.0);
bondB5.setElectronCount(4);

cyclohexatriene.addAtom(atomC0); cyclohexatriene.addAtom(atomC1);
cyclohexatriene.addAtom(atomC2); cyclohexatriene.addAtom(atomC3);
cyclohexatriene.addAtom(atomC4); cyclohexatriene.addAtom(atomC5);

cyclohexatriene.addBond(bondB0); cyclohexatriene.addBond(bondB1);
cyclohexatriene.addBond(bondB2); cyclohexatriene.addBond(bondB3);
cyclohexatriene.addBond(bondB4); cyclohexatriene.addBond(bondB5);

return cyclohexatriene;
}


Summarizing, the key thing is to use the IBond.setElectronCount() method. The call is sort of redundant, as the CDK defaults to two electrons if not explicitly given. This compound is, of course, benzene which we can represent like this too:

public IMolecule makeBenzene() {
IMolecule benzene = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);
IAtom atomC5 = builder.newAtom(Elements.CARBON);
atomC5.setID("C5"); atomC5.setHydrogenCount(1);

IBond bondB0 = builder.newBond(atomC0, atomC1);
bondB0.setElectronCount(2);
IBond bondB1 = builder.newBond(atomC1, atomC2);
bondB1.setElectronCount(2);
IBond bondB2 = builder.newBond(atomC2, atomC3);
bondB2.setElectronCount(2);
IBond bondB3 = builder.newBond(atomC3, atomC4);
bondB3.setElectronCount(2);
IBond bondB4 = builder.newBond(atomC4, atomC5);
bondB4.setElectronCount(2);
IBond bondB5 = builder.newBond(atomC0, atomC5);
bondB5.setElectronCount(2);

IBond bondingSystem = builder.newBond();
bondingSystem.setElectronCount(6);
bondingSystem.setAtoms(
new IAtom[] { atomC0, atomC1, atomC2,
atomC3, atomC4, atomC5}
);

benzene.addAtom(atomC0); benzene.addAtom(atomC1);
benzene.addAtom(atomC2); benzene.addAtom(atomC3);
benzene.addAtom(atomC4); benzene.addAtom(atomC5);

benzene.addBond(bondB0); benzene.addBond(bondB1);
benzene.addBond(bondB2); benzene.addBond(bondB3);
benzene.addBond(bondB4); benzene.addBond(bondB5);
benzene.addBond(bondingSystem);

return benzene;
}


This version represents the delocalized aromatic pi-system as one IBond:
one with 6 electrons, and 6 associated atoms.

The cyclopentadienyl anion is represented similarly:

public IMolecule makeCycloPentadienylAnion() {
IMolecule cp = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);

IBond bondB0 = builder.newBond(atomC0, atomC1);
bondB0.setElectronCount(2);
IBond bondB1 = builder.newBond(atomC1, atomC2);
bondB1.setElectronCount(2);
IBond bondB2 = builder.newBond(atomC2, atomC3);
bondB2.setElectronCount(2);
IBond bondB3 = builder.newBond(atomC3, atomC4);
bondB3.setElectronCount(2);
IBond bondB4 = builder.newBond(atomC4, atomC0);
bondB4.setElectronCount(2);

IBond bondingSystem = builder.newBond();
bondingSystem.setElectronCount(6);
bondingSystem.setAtoms(
new IAtom[]{ atomC0, atomC1, atomC2, atomC3, atomC4}
);

cp.addAtom(atomC0); cp.addAtom(atomC1);
cp.addAtom(atomC2); cp.addAtom(atomC3);
cp.addAtom(atomC4);

cp.addBond(bondB0); cp.addBond(bondB1);
cp.addBond(bondB2); cp.addBond(bondB3);
cp.addBond(bondB4); cp.addBond(bondingSystem);

return cp;
}


And the final step in this series, is ferrocene:

public IMolecule makeFerrocene() {
IMolecule ferrocene = builder.newMolecule();

IAtom atomC0 = builder.newAtom(Elements.CARBON);
atomC0.setID("C0"); atomC0.setHydrogenCount(1);
IAtom atomC1 = builder.newAtom(Elements.CARBON);
atomC1.setID("C1"); atomC1.setHydrogenCount(1);
IAtom atomC2 = builder.newAtom(Elements.CARBON);
atomC2.setID("C2"); atomC2.setHydrogenCount(1);
IAtom atomC3 = builder.newAtom(Elements.CARBON);
atomC3.setID("C3"); atomC3.setHydrogenCount(1);
IAtom atomC4 = builder.newAtom(Elements.CARBON);
atomC4.setID("C4"); atomC4.setHydrogenCount(1);
IAtom atomC5 = builder.newAtom(Elements.CARBON);
atomC5.setID("C5"); atomC5.setHydrogenCount(1);
IAtom atomC6 = builder.newAtom(Elements.CARBON);
atomC6.setID("C6"); atomC6.setHydrogenCount(1);
IAtom atomC7 = builder.newAtom(Elements.CARBON);
atomC7.setID("C7"); atomC7.setHydrogenCount(1);
IAtom atomC8 = builder.newAtom(Elements.CARBON);
atomC8.setID("C8"); atomC8.setHydrogenCount(1);
IAtom atomC9 = builder.newAtom(Elements.CARBON);
atomC9.setID("C9"); atomC9.setHydrogenCount(1);
IAtom iron = builder.newAtom(Elements.IRON);
iron.setID("Fe10"); iron.setHydrogenCount(0);

IBond bondB0 = builder.newBond(atomC0, atomC1);
bondB0.setElectronCount(2);
IBond bondB1 = builder.newBond(atomC1, atomC2);
bondB1.setElectronCount(2);
IBond bondB2 = builder.newBond(atomC2, atomC3);
bondB2.setElectronCount(2);
IBond bondB3 = builder.newBond(atomC3, atomC4);
bondB3.setElectronCount(2);
IBond bondB4 = builder.newBond(atomC4, atomC0);
bondB4.setElectronCount(2);
IBond bondB5 = builder.newBond(atomC5, atomC6);
bondB5.setElectronCount(2);
IBond bondB6 = builder.newBond(atomC6, atomC7);
bondB6.setElectronCount(2);
IBond bondB7 = builder.newBond(atomC7, atomC8);
bondB7.setElectronCount(2);
IBond bondB8 = builder.newBond(atomC8, atomC9);
bondB8.setElectronCount(2);
IBond bondB9 = builder.newBond(atomC9, atomC5);
bondB9.setElectronCount(2);

IBond bondingSystem1 = builder.newBond();
bondingSystem1.setElectronCount(6);
bondingSystem1.setAtoms(
new IAtom[] {
atomC0, atomC1, atomC2, atomC3, atomC4, iron
}
);
IBond bondingSystem2 = builder.newBond();
bondingSystem2.setElectronCount(6);
bondingSystem2.setAtoms(
new IAtom[] {
atomC5, atomC6, atomC7, atomC8, atomC9, iron
}
);
IBond bondingSystem3 = builder.newBond();
bondingSystem3.setElectronCount(6);
bondingSystem3.setAtoms(
new IAtom[]{
atomC0, atomC1, atomC2, atomC3, atomC4,
atomC5, atomC6, atomC7, atomC8, atomC9,
iron
}
);

ferrocene.addAtom(atomC0); ferrocene.addAtom(atomC1);
ferrocene.addAtom(atomC2); ferrocene.addAtom(atomC3);
ferrocene.addAtom(atomC4); ferrocene.addAtom(atomC5);
ferrocene.addAtom(atomC6); ferrocene.addAtom(atomC7);
ferrocene.addAtom(atomC8); ferrocene.addAtom(atomC9);
ferrocene.addAtom(iron);

ferrocene.addBond(bondB0); ferrocene.addBond(bondB1);
ferrocene.addBond(bondB2); ferrocene.addBond(bondB3);
ferrocene.addBond(bondB4);
ferrocene.addBond(bondB5); ferrocene.addBond(bondB6);
ferrocene.addBond(bondB7); ferrocene.addBond(bondB8);
ferrocene.addBond(bondB9);
ferrocene.addBond(bondingSystem1);
ferrocene.addBond(bondingSystem2);
ferrocene.addBond(bondingSystem3);

return ferrocene;
}


Now, you will note that this approach does not exactly follow Rich's FlexMol examples: the skipped atom pair concepts in the FlexMol version of ferrocene. His example, more closely follows what we are likely to draw, while the CDK code above more closely follows the molecular orbital concept. (I have to check to see how Dietz and Bauerschmidt did this.)

As said, the real trick is to have the chemoinformatics toolkit that can work with this representation, but I will save that for later. At least our algorithms to calculate the molecular mass should work ;)

Thursday, December 21, 2006

Updated Chemical Blogspace Layout and Software

Last night I upgraded the software behind Chemical blogspace, to the version online on Google Code, though I needed the help from Eaun to get paper titles correctly picked up for ACS journals. The number of working blogs is a bit down and now at 68, with an average number of 30 active blogs posting more than 100 blog items each day (see Zeitgeist). The new design looks like quite nice compared to the old one:


Tuesday, December 19, 2006

Chemistry in HTML: Greasemonkey again

Here's a quick update on my blog about SMILES, CAS and InChI in blogs: Greasemonkey last sunday. The original download was messed up :( You can download a new version at userscripts.org.

This new version also supports "chem:compound", for any chemical. For example:
  • isopropyl alcohol

Remember that it only works for properly marked up content, as described in Including SMILES, CML and InChI in blogs. The HTML source code of the above example looks like (in RDFa):

<ul><li>
<span xmlns:chem="http://www.blueobelisk.org/chemistryblogs/"
class="chem:compound">isopropyl alcohol</span>
</li></ul>

The current script only adds search links to PubChem and Google, but the possibilities are endless, and potentially very powerfull. Here are some future ideas.

A link to predict NMR spectra using NMRShiftDB.org:

Making a link to the NMRShiftDB.org website to predict
13C or 1H NMR from a SMILES, and InChI likely too, is easy, if the website provides a URL to do this. (I will discuss this with Stefan.)

A popup window with the 3D structure in Jmol:

This would involve some more work, but this most certainly possible too, given that we actually have a website around which allows downloading 3D coordinates given a SMILES or InChI. While a simple approach would be to make a popup with Jmol that takes the URL to that 3D coordinate website, it could be extended using Ajax to query the 3D structure first, and depending on success, show Jmol or a message "Could not find 3D coordinates".

Summarize molecular details hidden in CML:

This is likely the most exiting possibility. I blogged about CMLRSS many times now (check the AVI, the article, etc), and combining these two technologies will take the semantic, chemistry internet to the next level. CMLRSS describes how CML can be embedded in blog items (e.g. Blogging chemistry on blogspot.com), but really works for any XHTML.

Consider this mockup: add CML content to your blog item, containing molecular properties, such as it's NMR peaks, elemental analysis, etc. This will not show up in your blog item, so that the user is not bothered with implementation details. Now, a userscript will now about the CML content, as it has access to the whole content of the page. The visible text will mention the molecule for which CML contains experimental or other details. Using the <span class="chem:compound"/> technology shown above, it is possible to link that compound to this CML bit (details to follow in this blog in January 2007). The userscript will then on the fly create a popup for the compound name in the visible text to show those experimental details.

How about that? Comments and other ideas are more than welcome!

Server side scripts:

Greasemonkey allows users to decide which scripts to run on a website, and which not. If you, as blogger or XHTML editor, want to force a script like the above to be run, that should be possible too. Greasemonkey scripts are written in JavaScript, so including them on the server side should be possible too. I might explore this option soon too.

Sunday, December 17, 2006

SMILES, CAS and InChI in blogs: Greasemonkey

As follow up on my Including SMILES, CML and InChI in blogs blog last week, I had a go at Greasemonkey. Some time ago already, Flags and Lollipops and Nodalpoint showed with two cool mashups (one Connotea/Postgenomic and one Pubmed/Postgenomic) that user scripts are rather useful in science too. I can very much recommend the PubMed/Postgenomic mashup, as PubMed has several organic chemistry journals indexed too!

So, how does this relate to my blog of last week? Well, would it not be nice that if your blog uses the markup as suggested in that blog, that you automatically get links to PubChem and Google? That is now possible with a small GPL-ed Greasemonkey script called blogchemistry.user.js.

The Greasemonkey plugin requires Firefox to be installed. If ready, install the script by clicking this link earlier, and the Greasemonkey will ask you if you want to install the script. After, check the output for this RDFa markup content:
  • a SMILES: CCO
  • a CAS registry number: 50-00-0
  • and an InChI: InChI=1/CH4/h1H4

It should look like the output for this blog item:

Note the superscript PubChem and Google links.

Update: there was something wrong with the download, which I just fixed (19th, at 8:45 CET). Please download once more to get it working properly.

Counting constitutional isomers from the molecular formula

We all know the combinatorial explosion when calculating the number of possible constitutional isomers (see wp:structural isomorphism) of a certain molecular formula. For example, C2H6 has only one constitutional isomer (ethane, InChI=1/C2H6/c1-2/h1-2H3), and C4H10 has only two. Especially, breaking symmetry by replacing one carbon by another element, or replacing a single by a double bond, increases the number sharply. For example, C7H16 has only nine constitutional isomers, while replacing two single bonds by two double bonds, creating C7H10, increases this number to 499! Then, replacing in the last formula, one carbon by an oxygen adds another few, totaling 747 isomers.

Now, C8H8NBr has at least 649 thousand constitutional isomers, and I am quite interested in being able to know the number of isomers beforehand, without having to generate the structures itself (for example, using CDK's GENMDeterministicGenerator). InChI=1/C8H8BrN/c9-7-1-2-8-6(5-7)3-4-10-8/h1-2,5,10H,3-4H2 is one of the isomers.

So, my question: is anyone aware of free code (in order of preference: 1. LGPL, 2. BSD/MIT, 3. opensource, 4. free) to calculate or estimate the number of constitutional isomers for a certain molecular formula. An estimate would already be nice. Ideally, I would implement this bit of code into the CDK, but otherwise, just knowing the number of isomers for C8H8NBr would be nice :)

Additionally, any relevant, recent literature recommendations are most welcomed. I am aware of the use of polynomials, but literature I have seen so far just focuses on molecules of a certain architecture, and it not able to come up with a guess based on the molecular formula alone.

Tuesday, December 12, 2006

Molecular Chemometrics

I just found out that a review article that I wrote earlier this year got printed: Molecular Chemometrics (DOI: 10.1080/10408340600969601), with my personal view on the interplay between chemoinformatics and chemometrics. The review discusses interesting developments in the last five years, and was fun writing (reading too, I think :). It has four major topics:
  • molecular representation (with 'molecular descriptors' and 'beyond the molecule')
  • chemical space, similarity and diversity
  • activity and property modeling (with 'dimension reduction' and 'model validation')
  • library searching, which mostly focuses on semantic web developments
.

Comments most welcome; just leave them below this blog item, or blog about the article yourself :)

Sunday, December 10, 2006

Including SMILES, CML and InChI in blogs

The blogs ChemBark and KinasePro have been discussing the use of SMILES, CML and InChI in Chemical Blogspace (with 70 chemistry blogs now!). Chemists seem to prefer SMILES over InChI, while there is interest in moving towards CML too. Peter commented.

Any incorporation of content other than images and free text requires some HTML knowledge, but this can be rather limited. It is up to us chemoinformaticians to write good documentation on how to do things; so here is a first go.

Including CML in blogs and other RSS feeds

I blogged about including CML in blogs last February, and can generally refer to this article published last year: Chemical markup, XML, and the World Wide Web. 5. Applications of chemical metadata in RSS aggregators (PMID:15032525, DOI:10.1021/ci034244p). Basically, it just comes down to putting the CML code into the HTML version of your blog content, though I appreciate the need for plugins.

Including SMILES, CAS and InChI in blogs

Including SMILES is much easier as it is plain text, and has the advantage over InChI that it is much more readable. Chris wondered in th e KinasePro blog on how to tag SMILES, while Paul did the same on ChemBark about CAS numbers.

Now, users of PostGenomic.com know how to add markup to their blogs to get PostGenomic index discussed literature, website and conferences. Something similar is easily done for chemistry things too, as I showed in Hacking InChI support into postgenomic.com (which was put on lower priority because of finishing my PhD). PostGenomic.com basically uses microformats, which I blogged about just a few days ago in Chemo::Blogs #2, where I suggested the use of <span class="chemicalcompound">asperin.

And this is the way SMILES, CAS and InChI's can be tagged on blogs. The <span> element is HTML code to indicate a bit of similar content in HTML, and can, among many other things, be formatted differently than other text. However, this can also be used to add semantics in a relatively cheap, but accepted, way. Microformats are formalized just by use, so whatever we, as chemistry bloggers, use will become the de facto standard. Here are my suggestions:
  • for SMILES: <span class="smiles">CCO</span>
  • for CAS registry numbers: <span class="casnumber">50-00-0</span>
  • for InChI: <span class="inchi">InChI=1/CH4/h1H4</span>

The RDFa alternative

The future, however, might use RDFa over microformats, so here are the RDFa equivalents:
  • for SMILES: <span class="chem:smiles">CCO</span>
  • for CAS registry numbers: <span class="chem:casnumber">50-00-0</span>
  • for InChI: <span class="chem:inchi">InChI=1/CH4/h1H4</span>

which requires you to register the namespace xmlns:chem="http://www.blueobelisk.org/chemistryblogs/" somewhere though. Formally, the URN for this namespace needs to be formalized; Peter, would the Blue Obelisk be the platform to do this? BTW, this is more advanced, and currently does not have practical advantages over the use of microformats.

Saturday, December 09, 2006

H-index in chemoinformatics

Peter blogged about the h-index, which is a measure for ones scientific impact. He used Google Scholar, but I do not feel that that database is clean enough. I believe a better source would be the ISI Web-of-Science.

Therefore, I composed a list of h-indices of my own, ordered by value. The choice of authors is biased to the Blue Obelisk and the CDK, has some personal touches (Buydens are Wehrens are my PhD supervisors) and some names that put the rest into perspective:


queryh-index#pubs
BENDER A41222
WILLETT P37302
GASTEIGER J33212
RZEPA HS25236
BUYDENS LMC18108
GLEN RC1878
WEHRENS R1147
MURRAY-RUST P*941
STEINBECK C929
FECHNER U612
GUHA R424
WILLIGHAGEN E*49
WEGNER JK39
LUTTMANN E24


Of course, there are many comments on this. Like any measurement, take into account the error. Sources of error include, but are not limited to, ambiguity in the query. The most notable example of this, I think, is Andreas Bender; I don't think he has been that successful :) Also, Rajarshi Guha's h-index was reported 6, but the list included two articles from the 70-ies and 80-ies, which I do not think are actually really his.

Feel free to suggest other names, query corrections, tips, and I will add or work on those too.

Wednesday, December 06, 2006

Chemo::Blogs #2

Because no one picked up my Chemo::Blogs suggestion, I will now officially claim the blog series title. However, unlike the original Bio::Blogs series, I will not summarize interesting blogs, but just spam you with websites I recently marked as toblog on del.icio.us.

Semantics and Text Mining

Evan Prodromou wrote about RDFa vs microformats. The latter are commonly used in enhancing blog semantics, and for example used by PostGenomic.com. While RDFa is more explicit, e.g. by using namespaced markup, we have to wait until XHTML2 to see it working. I do not think chemists are using tags a log yet, but let me propose the following microformats: <span class="inchi">1/CH4/h1H4</span> and <span class="chemicalcompound">methane<span>. Standard JavaScripts and CSS scripts will then do the rest. (Think: addressing newlines, auto googling-for-inchi, etc).

The reason why using microformats is interesting, is text mining, of various kinds. Whether it is setting up a molecule-article link database, or find hot molecules in blogspace, adding semantics will help tools like OSCAR3 to mine chemistry. Some time ago OTMI was proposed by Nature, and they now set up a dedicated web site to explain there view on text mining. Zack Rosen has a good idea why RDF Semantic web research isn't working

Blogspace

There are a few new chemistry blogs I want to mention (and already added to Chemical blogspace): ChemBark, lirico which has an interesting chemoinformatics section, and The Curious Wavefunction. Worth reading indeed.

Pierre's YOKOFAKUN deserves a paragraph of his own. He recently blogged about bio2rdf which provides an RDF interface to biochemical knowledge via Life Science Identifiers (LSID), OBOEdit which is a Java based ontology editor, and Amadea which is a Taverna and KNIME like tool for setting up UNIX pipes.

Online EMBL Symposium

A few EMBL PhD students are having the First Online EMBL PhD Symposium (catchy name, or ... ;) Anyway, discussions are held on IRC, and it has a rather interesting Web2.0 session. All media is available on the website but requires registration right now. After the conference it will become open access to all. Jean-Claude contributed The UsefulChem Project: Open Source Chemistry Research using Blogs and Wikis to the Participants' Contributions section, and I did have a poster on Distributing molecular information over the Internet, discussing CMLRSS, blog aggregators, CML and other things. The IRC session was logged and is available here.

Literature

Finally, I want to mention three recent articles. First one is a recent write up by Bourne and Friedberg about Ten Simple Rules for Selecting a Postdoctoral Position (DOI: 10.1371/journal.pcbi.0020121). With the end of my current postdoc position nearing, rather useful reading. Some time ago I blogged about a New open access journal Source Code for Biology and Medicine, and the journal is now up and running. Details can be read in the first editorial (DOI: 10.1186/1751-0473-1-1). The third article I would like to mention is Scientific Software Development Is Not an Oxymoron by Baxter (DOI: 10.1371/journal.pcbi.0020087), though I do not think it has new insights.

OK, this was a rather lengthy write up, but really needed to clean up my toblog section :)

The power of big numbers

Contributions to open data do not have to be large, as long as many people are doing it. The Wikipedia is a good example, and PubChem accepts contributions of small databases too (I think). The result can still be large and rather useful, even scientifically.

The latter was recently written down in the paper Internet-based monitoring of influenza-like illness (ILI) in the general population of the Netherlands during the 2003–2004 influenza season by Marquet et al. (DOI: 1471-2458/6/242). The data was provided by Internet users via The Great Influenza Survey website. The article states that the sum of all those small contributions (anonymous website users are asked to fill out a weekly form), yields reliable data. The user is rewarded by colorful pictures, such as:

If all chemists and biochemists would add information about or properties of one molecule or metabolite to the Wikipedia each month, one or more commercial database companies will have to change their business model soon. Oh, you already can start doing this here.