Thursday, April 30, 2009

New FriendFeed layout, but there is a fix...

FriendFeed is the missing link between social [bookmarking|news|...] and IRC (#cdk on; I quite like it. Anyway, as of today, they have a new layout, and that I do not like. No more icons for feed types, and big avatar photo's. Really, I *know* what my fellow blogger look like (even met many of them in London last year). The rest of the layout is a bit too colourful for my taste.

Fortunately, Neil posted three very useful GreaseMonkey script to clean stuff (and since I use those to link science databases and resources anyway, see Christmas Presents and DOI:10.1186/1471-2105-8-487): FriendFeed Service Icons, Cleaner FriendFeed, and Remove avatars from Friendfeed beta. The last may require the script target websites to no longer point to the beta server, but the real thing. However, by the time you read this, the script may already be updated.

After installing these, my FriendFeed page looks better again:

Wednesday, April 29, 2009

Things to do...

I know I am lagging behind things... been busy and did not have time to reply to everyone yet. Some TOREPLY's go back more then a month. Sorry about that!

Some of the things on my TODO list (in random order): Bioclipse2 bug fixes, CDK patch reviewing (e.g. vflib), look at the Jmol-CDK bridge and bring it into action in Bioclipse2, RDF for PubChem, convert the Woordenboek Organische Chemie data into in RDF, RDF for NMRShiftDB, align with ChemAxiom, publish about the Bioclipse2 RDF feature, finish the MetWare paper, write a metabolomics feature for Bioclipse2, finish the pKa prediction in the CDK, write 100% coverage CDK 2 CML 2 CDK, implement atom parity stereochemistry from SMILES and/or MDL molfiles, use supervised SOMs in QSAR, user supervised SOMs in proteochemometrics, study variable influence on supervised SOM models, make my thesis Open in the Radboud University library repository (excluding the papers I no longer have copyright on), update the QSAR and algorithm ontologies in OWL, create a web page with life ONS solubility RDF, create an ONS solubility Bioclipse2 feature, study the CDK fingerprint performance compared to the new PubChem fingerprint, make Chemical blogspace aware of the ChemSpider widget, interest people for an unconference in Stockholm or Uppsala, move house, learn to Swedish, get a driver license, implement a memory more-efficient CDK interfaces implementation, promote XMPP services which are better than SOAP, and write more papers, work on CMLSpect for metabolomics, finish the CDK book, finally get a grant application approved, read up with literature and summarize in blog, port the Jmol UFF force field code to the CDK, analyze atom typing in the CDK against PubChem and StarLite, compile strigi-chemistry again KDE 4.2.2, finish homepage, ... you know the regular list of things to do.

If you happen to be a masters student interested in doing a internship/practical here in Uppsala (unpaid, but you will learn so much), just email me.

Friday, April 24, 2009

CDK Workshop 2009 #3

Last of my writing on the CDK Workshop. It was great fun meeting all the CDK developers and users, and thanx to everyone for all that they contributed, in particular during the unconference part! Yesterday, I had a travel day, and slept 12 hours in one go last nite. This leaves me with a long list of follow emails, CDK patches and many other things to catch up with. But it was more than worth it.

In the next months, we will say which conversations during the workshop will lead to fruitful collaborations and new CDK contributions. I already have a patch around for @cdk.threadsafe and @cdk.threadnonsafe in reply to the Threading session at the unconference, which I'll ask Rajarshi to review.

Earlier in this series:

Wednesday, April 22, 2009

My CDK Workshop 2009 Course Material #2

I wrote about my course material, and now complement that with the (three) slides:
CDK Workshop 2009 Slides CDK Workshop 2009 Slides Egon Willighagen My (three) slides for CDK Workshop 2009.

CDK Workshop 2009 #2

The second CDK Workshop day started off with application presentations, which are well covered by Chris' blog posts:. You can also find coverage on Twitter by Jim and Nico.

The afternoon we had a more development oriented unconference. It was a bit exciting for me and Chris as previous such sessions at CDK workshops involved 5 to 10 participants instead some 20+ now, but things nicely self-organized into 5 sessions:

In words, and linking to the coverage in the CDK wiki:

Tuesday, April 21, 2009

CDK Workshop 2009 #1

Coverage on Planet CDK, Twitter #cdkws2009, Friend Feed's CDK room, and on this Wiki page. Chris blogged too.

Monday, April 20, 2009

My CDK Workshop 2009 Course Material

CDK Workshop 2009 Intro Course Material CDK Workshop 2009 Intro Course Material Egon Willighagen Course material for my part of the CDK Workshop 2009.

Friday, April 17, 2009

Downloading Domoic Acid from PubChem

The identity of domoic acid has been under discussion (see here, here and here). (And I very much like the ChemSpider service to make it easy to copy data from ChemSpider into WikiPedia ChemBoxes; cheers!)

Now, my practical in next weeks CDK Workshop will use Groovy (please install it on your laptop!), and am hacking up example scripts for the course material, and came up with this script to download the structure of domoic acid from PubChem (CID:5282253):

CDK 1.2.1 Released

I just released CDK 1.2.1 (aka The CDK Workshop 2009 Release), which is now available for download from SourceForge. The source can be found in our Git repository. The changes since 1.2.0 are mostly bug fixing, new unit tests, and minor clean up here and there:
Fixed bug 2714283, which properly throws an exception when rings are not closed properly. If a ring is not closed with the appropriate ring number, InvalidSmilesException is thrown. Matches Daylight behavior
Fixed bug 2729120 and added unit test
Updated comment to fix bug 2768643.
Partial fix for bug 2719237. Made getBondOrderSum static, added unit test for it
Typo: proteinl -> protein
Made class public, to unbreak adding it to the build/*.javafiles
Partially fixed SMARTS matching for R0. Updated target molecule initialization to explicitly indicate atoms not in a ring and also updated RingMembership atom to do an explicit check when R0 is specified. Partially fixes bug 2587204
Fixed dubious equality test. A private method was checking Double objects via reference. Worked fine when they were null. Fails when we need to compare by value. Code is updated to take it into account. Added unit test (and made the method protected so that it can be tested)
Added test method annotation. Completes coverage for data module
Refactored ChiIndexUtils to make it package private. Cleans up public API, since it is only used by chi descriptor code. Updated all dependent classes. Moved test code (which needs to be filled in!) as well
Code cleanup of ChiIndexUtils. Converted to 1.5 idioms
Clean up of PathTools and added test method annotation, so that core is completely covered
Fixed the previous commit to edit the cdk.keyword line, not the cdk.module line
More consistent keywords used
Added a test to ensure that Integer objects are compared by value rather than reference
Added a test case to check that atom container diffs are correct when using deserialized objects
Fixed IntegerDifference so that it actually checks the integer value rather than references of the Integer object. Fixes the problem whereby an object serialized to disk and then deserialized does not match the original object (i.e., non empty diff string)
Applied patch #2675819 (Stefan Kuhn): Patch to add a removeReaction to reactionSet
Use interface instead of implementation
Removed an unused import
Use IAtomContainer instead of IMolecule, as the actual matching is using IAtomContainers already (fixes #2686249)
Fixed a ClassCastException (fixes #2685134)
Added source attrib to fix building the Ubuntu .deb
Fixed Help build system: use doclet jars in develjar/; updated for new src folder src/main; removed very outdated use of rt.jar
Removed libdepends include for test-ioformats, which does not actually have libdepends
Updated so that if a target atom has no symbol (such as pseudo atoms) the match returns false (rather than an NPE)
Fixed proper handling of #n SMARTS querys
Added test case for bug 2686473
Added note on Ant 1.7.1 required
Fixed a NPE source: 'null == 2' causes an exception, so first test for nullness
Fixed copyright notice for 2009
Fixed duplicate storage of layout templates, which only belong in the sdg module, not extra module too
Merge branch 'local1.2' of ../../git-svn/cdk
Thanx for all who reported bug reports!

Wednesday, April 15, 2009

Bioclipse2 Scripting #3: XLogP calculatation using a XMPP CDK cloud service

In preparation of the CDK workshop next week, here is a small Bioclipse2 script to calculate the XLogP value for a given SMILES, using the a CDK-based XMPP service:

Earlier in this series:

Multiple inheritence for content types?

Bioclipse is an environment for handling and processing life sciences data. This data is present in files with a wide variety of formats, each of which can contain a particular data type. For example, a we can have a single molecule in MDL molfile and in CML.

The latter is particularly interesting, as I do not know how to work that out... Firstly, I want the CML (Single Molecule) content type extend the CML content type, so that a validating CML editor can open it with the proper schema, but at the same time I would like to extend it a content type representation a Single Molecule. Hence, the multiple inheritance.

This is what the plugin.xml currently looks like:

name="CML (Single 2D Molecule)"
<describer class="net.bioclipse.cml.contenttypes.CmlFileDescriber">

Very clearly, a single base-type. Is there any option of multiple inheritance?

Monday, April 13, 2009

Rednael, CDK Git for Rajarshi's patches, PubChem SDF

Short blog item about some CDK Git updates. Could not get sleep, so might as well spend that time on CDK hacking, not? Reason why I actually could not catch sleep was the news that PubChem SD files are not regular MDL SD files, but use custom extensions, for example, for dative bonds (see this PDF). This surely explains the weird things I have seen, but, unfortunately, the big SDF button on PubChem does not warn about that. Anyway, thanx for Wolfgang for informing about that customization!

So, instead I hacked a bit on the CDK, which was about time. Last two weeks have been really busy with finding a new house (which we did), and writing two big grant applications (about done). Finally time for cleaning up my TOREPLY list on Gmail. I picked the request of Rajarshi to put online some of his patches, which are now available from, where you will find four of his patches ready for review: fp2d, pcore, pubchemfp and cleanpt. These are really interesting patches!

That brings me to the last thing for today: Rednael. Leander (a nickname already reserved, so reverse used) is an IRC bot for the #cdk channel which reports us of commits to our main Git repository. Back in the old SVN days (time goes so fast :), we had the CIA (Langley?) use there equipment to monitor SVN commit, and report those online and on IRC, but Git is too advanced for them, apparently. So, I wrote my own little bot to do it (see earlier link to GitHub). It can monitor multiple channels and report about multiple RSS feeds per channel. Thus, it is actually not restricted to Git commits alone.

Friday, April 10, 2009

Groovy CDK and the Keyword List

Today I have been hacking a bit more on the CDK material for the CDK workshop (see CDK - The Documentation). Below are two previews, one with a LaTeX-ified keyword list (here as HTML):

CDK Keyword List

And here about Groovy, which indeed is groovy:
Groovy CDK

Wednesday, April 08, 2009

"Open Knowledge: Reproducibility in Cheminformatics with Open Data, Open Source and Open Standards"

I have submitted today the abstract of my talk at the GDCh-Wissenschaftsforum Chemie 2009 in Frankfurt in August as part of the Open Notebook Science/Open Drug Discovery session:
    "Open Knowledge: Reproducibility in Cheminformatics with Open Data,
    Open Source and Open Standards"

    The Open paradigms in science have been met with strong criticism.
    Nevertheless, support and use of Open models among scientists is
    growing. While the Open model is certainly only one approach to doing
    science, it has a few aspects that make propagation of knowledge more
    transparent. Indeed, Open Data, Open Source and Open Standards
    (ODOSOS) make it easier to reproduce of knowledge and promote peer
    review. Various ODOSOS projects will be introduced which improve
    reproducibility in cheminformatics, the underlying science of
    exchanging chemical knowledge. Recent contributions of the Chemistry
    Development Kit, Bioclipse, chemical ontologies and others will be
    discussed that add to the repertoire of Open Cheminformatics, and how
    these contribute to Open Knowledge.
The exact details I do not know yet, and likely not before the weekend before the meeting :) But this blog gives a good impression of what you can expect.

Tuesday, April 07, 2009

Sunday, April 05, 2009

CDK - The Documentation

In preparation for the CDK workshop later this month, I am writing up my material for my kick-off presentation of the workshop. So, I better make it good. Using LaTeX at least overcomes my laziness which always made Word documents look stupid. Even default LaTeX looks good:

Clearly, any such documentation becomes quickly outdated, in particular when source code fragments are involved. Yes, CDK 1.2 is API stable, but only for the core classes. Moreover, I hope that the documentation will survive CDK 1.4 or 2.0 or whatever the next stable version is.

Therefore, I need to source code fragments compilable. R has the magnificent Sweave, and I wanted for a long time something similar. While I do not have something that powerful yet, at least my current set up allows me have code that both compiles and embeds in the LaTeX. The system allows me to write both Java application code as BeanShell scripts. No clue yet what I will use in the workshop, maybe even both. Like Sweave, it even saves output, and I can include that in the LaTeX source too. The code fragments can either go in as a verbatim section, or as a listing, depending on what I find more appropriate.