Saturday, January 30, 2010

Validating MDL SD files and Symyx molfiles with the CDK

Bioclipse 2.0 introduced a new, powerful molecular table support, and we have been eager to test that on large SD files. A recent ChEBI SD file failed to open, and eyes were immediately at the CDK, which is the cheminformatics library used in Bioclipse.

After careful investigations, it turned out that the ChEBI file contained a few entries which were not MDL molfiles, but queries for the ISISBase system. Those cannot be read by the CDK MDLV2000Reader. However, it crashed on it, instead of failing more savely. That's not nice, and fixed. But, the problem is rather recurrent, and the reason why I like CML so much: invalid input. CML, based on XML, has several general validation approaches that give in-depth error messages of what is wrong with the file.

So, I asked on the BOx what the Open Source cheminformatics community had to offer for this. Turns out that several tools find problems in the files, but none could report where the error occurred.

Now, some time ago, I played with two reading modes, RELAXED and STRICT, as faulty files is core cheminformatics material, and the software is blamed if the QSAR model resulting from it is not good (seriously). Anyways, a small API change in the CDK would make a validating MDLV2000Reader quite a step closer, but I had not followed up on it until last Friday where I patch I was reviewing caused 6 new unit test fails. The new fails were caused by a assumption which turned out the be false in the test files used in those 6 unit tests.

The MDL (or Symyx) molfile specifications (not an Open Specification) defines an atom block line as:
xxxxx.xxxxyyyyy.yyyyzzzzz.zzzz aaaddcccssshhhbbbvvvHHHrrriiimmmnnneee
but does not specify which fields are optional. And indeed, many tools around save MDL molfiles with one or more fields missing, leading to shorter than expected line lengths. And, as you might have expected, the failing unit tests had files with lines missing the field introduced by the patch, causing Exceptions being thrown around. I have yet to make up my mind of the lack of those fields is a problem in the file, or allowed by the format. In either case, the information from that field is not available, and the reader could safely ignore the missing information. Per user demand.

Now, personally, I rather send the file back to the user with a proper error report and show them what is wrong with the file. Or better, provide them with a MDL V2000 text editor (e.g. in Bioclipse) which would graphically highlight errors, as many of us are used to with Eclipse:

CDK Patch
So, I am hacking up a patch for CDK master to allow error reporting by IChemObjectReaders. The initial version of the API update and use in the MDLV2000Reader are available as Gist 290659. They are not final yet, as I realized when making the above screenshot, that merely int col is not enough, and that I actually need the startCol and endCol positions instead. Also, there are only an error level at this moment, and no warning level as in the screenshot.

That said, I created a jar (ant dist-large) and saved it as mdlCheck.jar, and wrote a bit of Groovy:

which defines a class implementing the new IChemObjectReaderErrorHandler and then reads a MDL molfile. And the output looks like it fulfills my needs:
$ CLASSPATH=mdlCheck.jar groovy mdlCheck.groovy src/test/data/mdl/test6.sdf 
location: 5, 35: Could not parse mass difference field.                                                                          
  -> For input string: ""                                                                                                        
location: 6, 35: Could not parse mass difference field.                                                                          
  -> For input string: ""                                                                                                        
Note to myself, that atom block does not like like a MDL molfile atom block at all! Every second line outputs the Exception passed to the error handler. I have to say, those messages are rather cryptic, but resulting from a NumberFormatException, if not mistaken.

Or, another common found issue (using D and T as element symbols):
$ CLASSPATH=mdlCheck.jar groovy mdlCheck.groovy src/test/data/mdl/hisotopes.mol
location: 6, 32: Invalid element type. Must be an existing element, or one in: A, Q, L, LP, *.
location: 7, 32: Invalid element type. Must be an existing element, or one in: A, Q, L, LP, *.

Enough for now... dinner time.

CDK 1.2.5: the changes

Mostly license statement fixes (thanx to Andrew for caring about it!), and a bug fix in the UniversalIsomorphismTester:
  • Removed bit which explain how to apply the LGPL to source (fixes #2926775) 12e8e4f
  • Attached are some more license files. 47a226a
  • The log4j.jar is version 1.2.15. 834ade8
  • More completed files attached. 9e88243
  • They were incomplete, as many other files still are. 261795f
  • Added a QA target ae661aa
  • Use local PMD and JUnit reports if available 35550bd
  • Added option to run it on just one module fcdad41
  • Added info for dependencies e4a90b1
  • Created a list, to be able to add license information 5b5e54d
  • Added missing copyright/license header 8dee40d
  • Catch a SocketException when there is no internet 5737371
  • Output where it is working on 7693308
  • Removed empty lines 53e60f9
  • Added initial license information, based on the information sent by Stefan 17b3c0c
  • Update code example in JavaDoc reflecting the current API (fixes #2914791) 75b4457
  • Updated UIT matching for the single atom case so that it correctly handles queries that are plain atom containers bbc8f60
  • Package fixing release: fixed building JavaDoc from source dist e151326
  • Added missing references file to the source dist (full and pure) 1cd2124
  • Removed source folders of Doclets, which are not part of the release, and should not be compiled for JavaDoc generation anyway dc8e5e7
  • Added a test to MoleculeSetTest, which tests that the clone() does not change the MoleculeSEt 42915c4
  • Added some extra lines, hopefully fixing the conflicts all the time b867fb2

Thursday, January 28, 2010

Semantic Web features in Bioclipse 2.2

Ola is releasing Bioclipse 2.2.0 today, and asked me to show case the semantic web functionality in Bioclipse. I realized that I do not have a nice page showing the semantic web overview. But I did blog a lot about RDF functionality, so here's a list of pointers:
Or check this screenshot from a Posterous post about a MyExperiment workflow:

One thing I have not blogged about yet (I think), is that the Bioclipse RDF manager also understands RDFa now. Well, sort of... it relies on a webservice, but this is what the script looks like:
model = rdf.createStore()
rdf.importRDFa(model, "")
rdf.saveRDFN3(model, "/Virtual/egonw.n3")
With support of SPARQL end points, and reading RDF from web resources directly (RDF/XML, N3, RDFa), Bioclipse is ready for the chemical semantic web.

Monday, January 25, 2010

Semantic Chemistry with the Resource Description Framework

First Call for Papers
Semantic Chemistry with the Resource Description Framework
240th ACS National Meeting & Exposition
Boston, Massachusetts, August 22-26, 2010
CINF Division

We now invite papers for our symposium on the use of the Resource Description Framework (RDF) technologies in semantic knowledge representation and data exchange in chemistry at the 240th National Meeting & Exposition of the American Chemical Society (ACS) in Boston this fall.

Semantic Chemistry has been around for a while, but is seeing a revival with the adoption of the Resource Description Framework (RDF) and matching technologies in chemistry. RDF triples provide a simple structure that allow data and knowledge alike to be presented in a single framework. Derived technologies include the capturing of ontologies with the Web Ontology Language (OWL) and performing queries with SPARQL. A wide variety of free and open source product make it easy to set up servers with large amounts of RDF data, while integration with HTML is available too with RDFa.

The RDF symposium at the 240th ACS national meeting in Boston invites submissions of talks about the use of RDF in chemistry and cheminformatics. Topics could include the use of OWL ontologies, OWL axioms, reasoning and interference, RDF in user interfaces, such as RDFa in web front ends, visualization, querying systems, and applications thereof, such as linking data sets, compound classification, cloud computing, web services, data aggregation, semantic publishing, and literature mining.

Abstracts may be submitted via You’ll find the RDF session as part of the CINF division symposiums. Submissions open January 25, 2010, and the deadline is March 28, 2010. In case of questions, please email Egon Willighagen at or Martin Braendle at

Thursday, January 21, 2010

Extracting RDF from Chem4Word documents

Joe has released the first Chem4Word demo file, and has written about how to extract the CML with Java and with C#.

I haven't actually gotten around to fiddling with Java, but ran Strigi against it to extract RDF, while having the Strigi-Chemistry plugins installed. This is part of the RDF that came out:
    "acetic acid",
    "(8R,9S,10R,13S,14S,17S)- 17-hydroxy-10,13-dimethyl- 1,2,6,7,8,9,11,12,14,15,16,17-dodecahydrocyclopenta[a] phenanthren-3-one",
I believe there is quite some room for improvement, but it's a start :) Thanx to Joe for posting the public domain test file, so that other projects can start play with the exiting new technology. I should note, however, that I am not running a Microsoft OS nor MS-Word, and the saved documents source are the only way I have access to the CML right now.

Sunday, January 17, 2010

Installation HOWTO for CDK-Taverna in Taverna 1.7.2

Thomas made a new release of CDK-Taverna for the Taverna 1.7.2 release, which is great news as the previous release was for Taverna 1.7.1.

He asked me to test it, and I installed a fresh Taverna install and the new plugin. After that, I used the MyExperiment plugin to download one of the CDK-Taverna workflows Thomas has on MyExperiment, and tuned it a bit to use some local input instead of the database. I took some screenshots while at it, and will use those now to talk you through the installation of Taverna and the CDK-Taverna plugin.

Download Taverna
Taverna 1.7.2 can be downloaded from this download page, but I took the Linux version from the SourceForge download site. I cannot detail the OS/X or Windows installation, but on Linux you simply unzip the downloaded file, and you're ready to go:
$ cd taverna-1.7.2/
$ sh
Plugin Installation
Plugins can be installed using with the Plugin manager which can be accessed via the Tools menu:

Clicking the Find New Plugins takes you to a second dialog listing known plugin sites, and the default download has several already:

The CDK-Taverna update site is available at, and we can make Taverna aware of this update site by clicking the Add Plugin Site button:

After filling out these values and approving it with the OK button, it will show up on the dialog showing all available plugins, where you need the check the check box in front of the CDK-Taverna plugin name, as done in this screenshot:

You can then hit the Install button after which the plugin will be downloaded:

After it is done downloading the plugin, you can close the Plugin Sites and Plugin Manager dialogs. I shutdown and restarted Taverna with sh, but not entirely sure this is needed. After that, the CDK nodes showed up in the list of Taverna processors:

MyExperiment Plugin
Using the same Taverna Plugin Manager you can also install the MyExperiment plugin that allows you to search, browse, preview and download Taverna workflows from the MyExperiment website from within Taverna itself. I installed the plugin, and then used it to search for CDK workflows (and downloaded a QSAR workflow):

This about everything to get you going. It's not particularly rocket science, but I guess this howto is useful as you get to see what you should expect when setting up a CDK-Taverna environment. If you have further questions, please leave those in the comments section, and I'll try to merge in answers where possible, or otherwise in the reactions too.

Friday, January 15, 2010

Warren DeLano and the future of PyMOL

This blog is old and new news. The old news is that Warren passed away at the end of last year, after having successfully shown how OpenSource cheminformatics (and/or bioinformatics) software can be developed in a commercial setting (DeLano Scientific), and PyMol was a huge success. Warren had a SourceForge account (wdelano) for almost 10 years:

I had not blogged about it before as the news hit me hard. Surely, Warren knew a lot of people and I only was only one of many, but Warren's memory sticked well. I know Warren from the Jmol project, where we talked in the past of coming to an Open Specification for exchanging scenes between Jmol and PyMol. Around the end of my PhD contract we even briefly, but seriously, explored doing a post-doc in his group.

Anyway, lot's of people wrote up blogs (in arbitrary order: Rich, P212121, MacResearch, Jörg, MMB, Shirley, Derek, Wavefunction, Dan, Barry, and probably many more). They have set up a memorial fund which will focus on promoting the Open Source ideas of Warren, including an Award.

Yesterday, I was pinged about Schrödinger acquiring PyMol. The press release is, as usual, short on details, but those have become clearer during the day. Schrödinger is not new to Open Source cheminformatics, and has an product based on KNIME, which is now GPL, but also has a proprietary license for those who wish to license so.

But, unless I missed any other Open Source (-oriented) product, the acquisition of PyMol significantly changes the game for them: PyMol is a major Open Source product, bigger than KNIME at the moment, I'd guess. My immediate response to the acquisition is whether they acquired copyrights, and they did, according to this commit:

This is important as it puts Schrödinger in charge of license changes. Fortunately, they seem rather serious about the Open Source thing, and hired an active PyMol developer (Jason), and kept the existing Open Source license:

Therefore, congratulations to Schrödinger for getting seriously into the Open Source community, making them the next Dr Who of PyMol, and congratulations to the family of Warren in ensuring continued development of the PyMol project! It's hearth-warming to see that despite the bad times they are going through, and all they options they had with the PyMol code base, they find time for and strength in supporting Warren's ideas about the future of cheminformatics. My thoughts are with them!