Saturday, May 28, 2011

CDK 1.2.10: the changes, the authors, and the reviewers

This may very well be the last bug fix release for the 1.2.x series. Or, at least that is what I am aiming for. This release is particularly aimed at (and from) Debian, fixing the puredist Ant target, contains a new atom type (for Te), and a fix in a molecular descriptor class.
  • Copy data files into the right folder of the puredist bdb4e0e
  • Added a test to demonstrate the ClassPathException in bug #3305581 7947f08
  • Test if the descriptor results are the same for two implementations (unfortunately, the nonotify IMolecule extends the data IMolecule, so it does not catch bug #3305581) 7a795a5
  • Factored out a method to test whether to descriptor calculations give the same results 2945d44
  • Parameterized method to create water to allow alternative implementations 9b99f11
  • Updated code to avoid dependency on specific implementation of molecule object. Now use IAtomContainer aee368a
  • Added detection of a Te atom type, found in ChEMBL 2f74024
  • Added unit tests^Cor Te.3 atom type detection 1b9afc9

The Authors
6  Egon Willighagen
1  Onkar Shinde
1  Rajarshi Guha

The Reviewers
5  Rajarshi  Guha 
2  Jonathan Alvarsson 
1  Egon Willighagen 

(Git commands use are given here)

Wednesday, May 25, 2011

My slides for this afternoon's Bioclipse-OpenTox demo

If you're quick, you might still be able to sign up for this afternoons Bioclipse-OpenTox demo. Because of unfortunate and unforeseen circumstances, I will not be able to talk about Bioclipse-OpenTox scripting myself, and have asked Ola to cover that, on top of his own (cool) part.

The demo software can be downloaded here, and the demo scripts as by clicking the Download button on this page.

Let me know how the demo was!

Monday, May 23, 2011

Online Seminar demoing Bioclipse-OpenTox interoperability using RDF technologies

Update: due to personal circumstances, it will very likely only be Ola to give the demo :(

This Wednesday, Ola and I will demo our work with Nina on interoperability between Bioclipse and OpenTox:
    OpenTox recently initiated a series of online seminars and interactive tutorials on topics related to predictive toxicology using a virtual conferencing system supporting desktop sharing and voice. The tutorials are part of the EU FP7 project OpenTox (, which aims at developing an interoperable open source predictive toxicology framework which may be used as an enabling platform for the creation of predictive toxicology applications.

    The next tutorial will be held Wednesday, 25 May, at 15:00 CEST, and will describe the approach for integrating OpenTox in Bioclipse, complemented with hands-on tutorials showing how to take advantage of OpenTox functionality from within Bioclipse via the user interface and scripts.

    The intended audience is people interested in learning how to use OpenTox from within Bioclipse, and people interested in how the integration was technically implemented. A bleeding-edge release of Bioclipse will be made available for the participants before the tutorial.

    Participation in these online events involves no registration fee. To register for the seminar, visit

    If you have already registered for a previous OpenTox tutorial this spring you don't need to register again.

Sunday, May 22, 2011

CDK 1.3.9: the changes, the authors, and the reviewers

Hot on the heels of the stable update 1.2.9, here is a new developers release. The 1.4.0 is still waiting for the basic (molecule) rendering of the CDK-JChemPaint patch to be submitted, but I think I will just submit it and see what happens. If rejected, I will release 1.4.0 without it. Mind you, I rather get that renderbasic patch up to shape (fix some regressions, add missing docs, and unit tests where possible, etc), but there is too little movement in this area.

Like many, this release too extends atom type perception to new atom types, making CDK algorithms applicable to more types of chemistry, but also includes fixes for perception when certain bits of information is missing from the input. Otherwise, it includes an improvement for the SMARTS query tool, a new AtomContainerManipulator extractSubstructure( IAtomContainer atomContainer, int... atomIndices) helper method, aromatic bond support for the HINReader, fixes in the PDB atom type table and improved PDB file IO, a faster, aromaticity-independent fingerprinter, and a set of general bug fixes and performance improvements.

The Authors
  34  Egon Willighagen
   8  Rajarshi Guha
   4  Gilleain Torrance
   3  Jules Kerssemakers
   2  Dmitry Katsubo
   2  Jonathan Alvarsson
   1  Julio Peironcely
   1  Onkar Shinde
   1  Stephan Beisken
   1  Mark Rijnbeek
This reflects contributions from Sweden, USA, UK, The Netherlands, and, I think, Pakistan. In terms of institutes, this includes patches working at (not implying the work was done as part of the day job): Karolinska Institutet, NIH, EBI (more than one group), Radboud University Nijmegen, EPO, Uppsala University, and the Debian project. That's pretty impressive for one release :)

The Reviewers
 20  Rajarshi Guha 
 13  Egon Willighagen 
  4  Gilleain Torrance
  1  Jonathan Alvarsson
  1  Jules Kerssemakers 

And of course kudos to all who reviewed patches!

The Changes
  • Fixed detection of PH3 with implicit hydrogens 82197ac
  • Added a unit test for a failing sulphur AT detection + fix: 3 coordinate sulphur without double bonds has one implicit hydrogen and is S.anyl 14d41c8
  • Fixed detection of P.ate atom types with one implicit hydrogen 4ded3e4
  • Added unit tests for bug #3141611 359b0b0
  • Added author tags 7158552
  • Make CONNECT records work with non-protein PDB files Rewrote CONNECT parsing Do not write CONECT records for atoms that have no bonds 50f79bf
  • Use wildcards (*) in jar file references. This fixes the problem while packaging the library for Debian as jar versions in Debian may not match with upstream. c79bcdc
  • Fixed JavaDoc in accordance to current API 03cfd36
  • Added a fingerprinter which does not take into account aromaticity, but looks at SP2 hybridization as replacement. 87d562c
  • Clarified what is read from what file type b9a3cd6
  • Fixed propagating of the error handler and reading mode. 997908d
  • catch right exception and use the logger eed4717
  • Use IAtomContainer instead of AtomContainer in AtomContainerSet 74112c4
  • Escape /'s too 6941321
  • fixed pdb_atomtypes.xml errors mentioned in, including the nigly vdwRadius for GLN.CD 9a07280
  • whitespace-only: alignment of TYR-section 81ba913
  • Typo fix -> Constructs 2a334d6
  • Fixed classpath to solve those annoying error messages about unknown packages 263c7e2
  • Added missing @Test annotation (idiot) df302b3
  • Fixed typo in PDBReader constructor 92609d2
  • Added a few missing tests c143185
  • Added a missing reference for Floyd's algorithm 49eb772
  • Fixed second half of patch a3e25e69419e53291e093d9365b4187843f03736 :( 74bfbc9
  • Added missing dependencies a3e25e6
  • Updated the author list 595cc9f
  • Fixed a concurrency error, caused by the use of a static field which was not supposed to be static, as the classes are instantiated just to allow customization c0ede3f
  • Added two new authors 833a9ac
  • Updated to the 1.4.x API c5443a7
  • Fixed the use of the proper Convertor 2f19e98
  • Added the Co(3+) atom type (fixes #3093644) 67d5625
  • Added the unit test phosphine for bug #3190151 740e4a5
  • Added a unit test for a phosphor without explicit or implicit hydrogens 5e980dd
  • Fixed proper syntax for ignored tests 95a59f7
  • Added unit test to see if descriptor specifications refer to existing BODO entries 7c33f88
  • Fixed descriptor identifier 0ffd85b
  • Added unit testing of the dict module 4e47a4e
  • Added code testing for the dict module b2d4f85
  • fixed spelling error in JavaDoc 3dcf2fe
  • Enabled PMD and OJDC for the tautomer module b73a662
  • Create tautomers based on InChI 68d21b7
  • Updated OrderQueryBond in SMARTS & isomorphism matching so that we correctly match when faced with aromatic bonds (rather than just looking at the bond order) cb4335e
  • Updated HIN reader to parse aromaticring keywords and appropriately mark atoms as aromatic. Updated test and added test case 265dbba
  • Added test HIN file for aromaticrings keyword 8a565e9
  • Patch fixing isTheSame method in MolecularFormulaRange c86b52e
  • Converted the use of HashMap to store SD (and other) properties to LinkedHashMap so that the order of properties read in from a SD file is maintained. Performance hit appears to be minimal b36a3e0
  • Removed nonexisting dependency 2b3f842
  • Removed nonexisting dependency dfe8cd9
  • Added some more elements to the valence table 0629d8e
  • Fixed NPE in AtomContainerComparator when container has pseudo atoms. fe8538f
  • Implemented simple LRU caching mechanism to avoid reparsing previously used SMARTS queries. The cache size is set to 20 by default but can be set by the user c3a768d
  • Implemented simple LRU caching mechanism to avoid reparsing previously used SMARTS queries b90c252
  • Added a not-so-unit test for reading elemental data from XML using the reader and the underlying SAX handler 0c565e4
  • Method to extract substructures cfa5c93
  • StructureDiagramGenerator methods now throw CDKException instead of catch-all java.lang.Exception. 7c6a0c5

Saturday, May 21, 2011

CDK 1.2.9: the changes, the authors, and the reviewers

The CDK 1.2.9 release does not contain overly many patches, something you'd expect from a bug fix only series, and has instead a few atom type perception patches, and a patch by Onkar simplifying building of Debian package:
  • Added Onkar Shinde, a Debian developer who recently patched the build system 95ad780
  • Fixed detection of P.ate atom types with one implicit hydrogen d619db2
  • Added unit tests for bug #3141611 ea248f3
  • Fixed detection of PH3 with implicit hydrogens 989b151
  • Added a unit test for a failing sulphur AT detection + fix: 3 coordinate sulphur without double bonds has one implicit hydrogen and is S.anyl 32f9aaa
  • Use wildcards (*) in jar file references. This fixes the problem while packaging the library for Debian as jar versions in Debian may not match with upstream. 82d76f2
  • Use IAtomContainer instead of AtomContainer in AtomContainerSet 4e2e66c
(Made with git log --oneline cdk-1.2.8..cdk-1.2.9 | sed 's/\([a-f0-9]*\)\s\(.*\).*/<li>\2 <a href="http:\/\/\/git\/gitweb.cgi?p=cdk\/cdk;a=commit;h=\1">\1<\/a><\/li>/'.)

The Authors
6  Egon Willighagen
  1  Onkar Shinde
  1  Gilleain Torrance
(Made with git shortlog -s -n cdk-1.2.8..cdk-1.2.9.)

The Reviewers
2  Rajarshi Guha 
  1  Jonathan Alvarsson
  1  Egon Willighagen 
(Made with git log cdk-1.2.8..cdk-1.2.9 | grep Signed-off | cut -d':' -f2 | cut -d'<' -f1 | sort | uniq -c.)

Friday, May 20, 2011

A few Virtuoso commands I learned today

Virtuoso is not new to me (e.g. using it for the ChEMBL-RDF SPARQL end point). The Debian package is, though. But the only big problem here was to figure out where the db ended up, though that did not really matter in the end. What did matter, is that Virtuoso is not happy about long resource URIs, so I had to cut down on links out to Another difference is that is that the isql command is called isql-vt and the default values are not really useful for large amount of triples. But in the end I had a working test system at my laptop.

And, I learned two new isql command, on top of the two (DB.DBA.RDF_LOAD_RDFXML_MT and DB.DBA.TTLP) I learned before. For of all, is how to empty a database (which I did so far by killing the running instance, deleting the .db files, and restart :):


And the other is the multithreaded version for loading N3 files (and, yes, I just realized that I was already using that for reading RDF/XML :):


That save a lot of time on this four core laptop.

And for the cheminformatician this means that I am an important step closer to a next release of the ChEMBL-RDF export, which will include InChIs and SMILESes.

Thursday, May 19, 2011

This blog is tracking you: cookies and EU regulation

Just to inform you, parts of this website are very likely tracking you using HTTP cookies, including at least Google Adsense, and likely  the Topsy extension to show how often a post was tweeted, and the LinkedIn button to show the same for their network. Possibly, not sure about that, where I host my blog also tracks you. Personally, I care about your comments, and those I track manually. Down the bottom of the blog, I have added a short text on cookies, hoping to fulfill the new EU regulation's requirements (thanx to John for the ping!).

Thematic Series around #RDF in chemistry now online

Thanx to the great support from the people at BioMed Central / Chemistry Central (not sure how apart / the same these two are), and Jan and Bailey in particular!, the Thematic Series around the ACSRDF2010 meeting of last August in Boston, is now online (and of course Martin for co-organizing the meeting and this series, to Pfizer, Inc. for sponsoring the series, see also our editorial, and to all authors who contributed their papers!):

BTW, I'm also quite happy with this funny functionality in ksnapshot I never noticed, that it can take free drawing screen region snap shots :)

Monday, May 16, 2011

Editorial ACSRDF2010 online (on RDF in chemistry)

The Thematic Series in the JChemInf is close to getting online with the first papers, indicated by Martin's and my editorial (doi:10.1186/1758-2946-3-15):

The other papers should be added to the journal this week, and I will keep you posted on that. The wordle shows one of the three session on RDF in chemistry of the meeting in Boston last year.

Friday, May 06, 2011

ChEMBL-RDF now linking to CrossRef

CrossRef reported recently they started supporting RDF, which is great news indeed! The RDF version of ChEMBL I created already contained the DOI, so the patch to link out was simple.

I have updated the docs.n3 for the downloads I prepared, which can simply be loaded into your triples store. A typical change looks like, so loading the new triples will only add triples, and URIs of papers are not changed:

@@ -742,6 +802,7 @@
  dc:isPartOf jrn:jaf567f18f5617b0fe77d704fe8f61a3e .
res:r91 a bibo:Article ;
  bibo:doi "10.1016/S0960-894X(01)80794-0" ;
+ owl:sameAs <> ;
  dc:date "1991" ;
  bibo:volume "1" ;
  bibo:issue "3" ;

Info on the license (CC-SA-BY), and which two papers you are asked to cite if you use this data, can be found here. Oh, and have a peek at the series on database cheminformatics by John (part1, part2, part3)!

Wednesday, May 04, 2011

The costs of VR fund applications. Tell me I'm wrong...

One of the main Swedish funding agencies is called the Vetenskapr√§det (VR). They just reported the number of applications in their big funding round of this year: 4606! That's is a staggering number. I decided to do some math here. Let's assume each proposal took about one week of effort, possible shared between two or more scientists. Let's assume the rate of one week of scientist at a Swedish university is about 1000 euro (so, including common overhead). That means that it costed the scientific community more then 4.6 million euro to apply for funding this year! Wow! Consider that amount 12% gets awarded (10% at any university, and 20% at KI where I work), that means that about 3.5 million euro was wasted making scientists learn to write grant applications... well spent indeed!

Now, I truly hope I am making some stupid mistake in my argumentation here... I don't like to thing what that amount of funding in one year could do for Open Source cheminformatics... (Oh, and this excludes the money needed to actually read these proposals, though they do not actually read all.)

Tuesday, May 03, 2011

Online SEURAT workshop: Open Data: What, Why, How?

An announcement.

Open Data: What, Why, How?
Monday May 9th 2011, 16:00 CEST, 7:00 PDT

(organized by ToxBank & Seurat DAWG)

Open Data is one approach to making data more easily re-usable including specifications by the the date creator on the terms of the type of re-use.

The goal of this interactive virtual meeting is to address misconceptions around Open Data and to answer questions experimental biologists, chemists, and toxicologists from any of the SEURAT-1 projects have with respect to how Open Data can help their research. Over 90 minutes the invited speakers Rufus Pollock (Open Knowledge Foundation) and John Wilbanks (Creative Commons) will reply to questions from the audience, complemented by a discussion by Barry Hardy (Douglas Connect) on how Open Data could contribute to the success of EU FP7 projects including OpenTox, ToxBank and SEURAT-1.

The meeting will be held online and will be organized via GotoMeeting. There are only 25 seats, so early registration is important, as we will fill seats on a first-come basis. However, one ‘seat’ can host multiple scientists behind a single computer.

Please send an email to Egon Willighagen ( listing your name, SEURAT-1 project, and email address. Details on how to log in will be send to that address shortly before the meeting.

16:00-16:05 Egon Willighagen - Why this meeting?
16:05-16:25 Rufus Pollock - Open Data: What, Why, How?
16:25-16:50 John Wilbanks - Open Biological Data
16:50-17:10 Barry Hardy - Potential impact of Open Data on our Research
17:10-17:30 General Q&A session