Saturday, December 31, 2011

CDK 1.4.7: the changes, the authors, and the reviewers

In preparation of the next (4th) edition of my Groovy Cheminformatics book on cheminformatics with the CDK, I found a show stopper bug, fixed it, sent in the patch, and Rajarshi quickly reviewed and applied it to the cdk-1.4.x branch. This particularly bug was a null pointer exception that was fixed not so long ago in the log4j implementation, but turned out to be present in the logger to STDOUT too.

This releases also fixes the reading of aliased atoms in MDL V2000 molfiles, thanx to another bug fix patch from John May (thanx!), and formally deprecates the nonotify implementation, which has already been removed from the master branch. The silent module should be used instead, which has the same functionality but has cleaner code and faster.

However, one important change you should take notice of, is an API change in the IIteratingChemObjectReader class. The change is minor, but useful. The interface is now typed, and implementing classes implement IIteratingChemObjectReader<IChemModel> (IteratingPCSubstancesXMLReader) or IIteratingChemObjectReader<IAtomContainer> (IteratingMDLReader, IteratingPCCompoundASNReader, IteratingPCCompoundXMLReader, IteratingSMILESReader). This means, that this iterator's next() method now returns an IChemModel or an IAtomContainer, and that casting in the using code is no longer needed.

The changes
  • Another hot fix: use @link with the full qualified class name, and removed the import, to fix a dependency issue 0e71cba
  • Added a @deprecated tag on the nonotify data classes, pointing to the silent implementation d283686
  • Fixed dependencies 5ef20b1
  • Extend the abstract suite, so to run the test for the null pointer exception 269c84c
  • Work with the interface 106e5ec
  • Check for a null input fb35047
  • Removed unneeded deps on CMLXOM for JNI-InChI (thanx to Dmitry Katsubo). 8524891
  • Added missing imports of IAtomContainer, needed by the last two patches, but which were not needed in master because we did all that IMolecule/IAtomContainer refactoring already 856f83c
  • Proper typing of the DefaultIteratingChemObjectReader, so that other classes can safely extend it (thanx to Nina) 6de90d3
  • Typed the iterator, removing the need for casting when used 44b7e76
  • Added John May as author 1142dc6
  • Also check that there are two such R1 atoms 962b7d2
  • Added modifications and unit test for alias atom naming patch bd4b094
  • Corrected alias atom naming in MDLV2000Reader and added test 23132a0
The authors

13  Egon Willighagen
  2  John May

The reviewers

6  Rajarshi Guha
2  Egon Willighagen
1  Nina Jeliazkova

CDK 1.4.6: the changes, the authors, and the reviewers

OK, I forgot to write those up again :(

Release 1.4.6 of about a month ago, fixes a few bugs, including broken JavaDoc, atom type perception when SMILES are parsed while keeping the lower case formalism as aromaticity indicators (I will not discuss the pros and cons of that here), and the Chi index descriptors for sulphurs. This release also introduces a new fingerprint, based on an extensive list of biologically-relevant substructures identified by Klekota and Roth in 2008 (doi:10.1093/bioinformatics/btn479). This functionality was backported by Jonathan from the PaDeL software by Yap Chun Wei. The rest is a bunch of small code and dependency clean ups as well as new unit tests.

The changes
  • Added missing unit tests 9119aa2
  • Added get-methods for information needed for extensions 4525cbe
  • A few missing unit tests in the 'qsar' module 2356a10
  • Added further methods needed for CDK-JChemPaint 804a6f5
  • Added missing JavaDoc. e316210
  • No longer complain about missing testing for abstract classes 284ff84
  • Typo: there → their 640b6e6
  • Added unit testing 05216f0
  • Throw a descriptive exception when 2D coordinates are missing (fixes #3355921) cdc4cbd
  • Fixed the cheminf.bibx well-formedness (fixes #3435367) 7bc0772
  • Added missing @cdk.githash. fdd3d22
  • Updaetd chi index util to correctly evaluate deltav for sulphurs. Fixes bug 3434741. Added unit test 8225175
  • Use interfaces instead of implementations 434c9b1
  • Use interfaces instead of implementations 76dcdf7
  • Use interfaces instead of implementations 5bef796
  • Use interfaces instead of implementations b6ed6a7
  • Moved the pi-contact descriptor (atom-pair) to qsarmolecular, removing the depedency of qsar on reaction d902312
  • Added a missing dependency; it now finds PDBPolymer f03eb3c
  • Fixed test method names f6562cb
  • Added a missing test a853c91
  • Fixed TestClass annotation d37961d
  • Added tests for the isomorphism module to the proper suite 657c3a7
  • fixed dependency for fingerprint tests 135edeb
  • added test for getSubstructure d1eb951
  • lookup SMARTS at index in Substructurefingerprint 65602db
  • Wrote test for KlekotaRothFingerprinter 2b1288b
  • adapted to CDK 84aefixesd70
  • Import from the source code of PaDEL-descriptor (doi:10.1002/jcc.21707) 291def4
  • Fix to use interfaces as argument instead of classes 5e828f5
  • Perceive atom types also when aromaticity from the SMILES is kept 2e76ff6
  • Added unit test to make sure atom types are also perceived when aromaticity from SMILES is kept b315ee6
The authors

25  Egon Willighagen
  5  Jonathan Alvarsson
  2  Rajarshi Guha
  1  Nina Jeliazkova
  1  Yap Chun Wei

The reviewers

13  Rajarshi  Guha 
  8  Egon Willighagen 
  1  Nina Jeliazkova 

Monday, December 19, 2011

A Google+ page for the CDK

A week or two ago I created a Google+ page for the CDK, which can be found here and it looks like:

I will use this page to post interesting stories around the CDK. It is not supposed to replace the Planet CDK, which aggregates blog posts from CDK developers and users, but for other, perhaps shorter posts. For example, as can be seen in the above screenshot, I have started to use it as a CDK Literature replacement, which originally was a series of articles in CDK News (the archives), and later hosted in my blog (see CDK Literature #5 for links to the other four).

But, I will (re)share anything I find useful for the CDK community.

Friday, December 16, 2011

Open Source cancer research (must see video)

Four years ago I wrote up a passionate post about the importance of ODOSOS, and about a year ago written about how research can be open sourced, and how the Open Source Chemistry Development Kit (CDK) fits in. I am proud that the CDK is enabling so many researchers to do novel research! Google Scholar reports over 200 documents and Web of Science gets to over 150 for just the first CDK paper (see also GS vs WoS). That's serious impact. We're far from being fully comparable with existing commercial tools, which have a 40 year head start. But with important functionality still missing (e.g. E/Z stereochemistry), and about 150 open bug reports which needs looking into, we surely can need funding and help!

Anyway, this Monday we had the last of this years Stockholm Open Science meetings, and Carl Bärstad joined, whom is organizing TEDx talks here in Stockholm, and he pointed me to this must see video on open source cancer research:

I can highly recommend watching it, as it is both very insightful about cancer (it's the third mechanism I know now how cells remember state, after DNA methylation, microRNAs, and now plain, boring protein; I wonder when metabolite-sized molecules show up as cell-division surviving state preservatives; they will).

But, it also puts Open Science ideas to practice in drug discovery. Yeah, the even give the structure (and SMILES) of JQ1 (see PubChem):

Now, the cancer the started of with, is pancreatic cancer, which is what my mother died of 3 years ago, which provides a third reason for me to love this video!

Update: I have uploaded JQ1 and the charged variant on ChemSpider to Ambit2 as dataset 976496, which means that you can look up toxicity predictions with ToxPredict for JQ1 on this page. This is what ToxPredict looks like, just before I hit Run all:

Wednesday, December 14, 2011

Google Scholar versus Web of Science

Web of Science (WoS) is the de facto standard for citation information. It's citation counts are used for many purposes, among which to decide I am a good scientist. Web of Science, however, really expensive, and Joe the Plumber does not have access. No wonder, he doesn't know which scientist to trust (...).

Recently, Google made their Scholar product open to all, allowing you to list your publications (about my list), which Google with augment with citation counts. If you search the web, you'll find much being said about the two, in particular compared with each other. One aspect is the accurateness of the citation counts, as people are afraid gaming, and random noise found on the web. Others would (counter)argue that Google captures a wider range of literature.

So, I was wondering how this would reflect on my impact. I know that WoS is not errorless either, and I have been making various support requests over the years (my WoS records still have errors). So, do they complement overlap? Are citation counts comparable. In fact, this turns out to be true:

I would be drooling if I got this kind of regression in my nanoQSAR studies! :) There is a very strong regression, indeed. One of the advantages of Google Scholar is does not select an elite group of journals (of course, they have to, because there data analysis process involves much more human curation), while Scholar captures newer Open Access journals, like the J. Cheminformatics, too. While I may be a bit of a non-typical scientist (some even argue I am not even doing science...), the overall outcome is that Google Scholar is actually more accurate about my impact than Web of Science is right now.

Thursday, December 08, 2011

Open Science and Non-Commercial licenses (a personal reflection to the Oscar/RSC controversy)

Peter has started a new line of discussion in his blog, referring to a correspondence with representatives from RSC last year, about an annotated literature corpus to (re)train the Oscar3/4 text miner. There are very many sides, and after I reread this post for a second time, I was still not 100% happy about all words: I can only try to express the complexity of the matter and how it started, but do hope to be clear that non-commercial licenses are not useful in Open Science.

I have taken part in parts of the correspondence Peter refers to, and I would not have written up things as Peter wrote up his impression of the outcome of that discussion, and at some point I seem to no longer have been included in the email correspondence, as I at least did not know the final outcome (see below), and cannot fully comment on the accuracy of Peter's coverage of that correspondence, but my impression on the outcome, as limited as it was, is not that far away from what Peter wrote up: Oscar4 needs training (doi:10.1186/1758-2946-3-41), and the RSC was unwilling to contribute the full text training corpus to the project without a non-commercial (NC) clause (and I explain below why I think this is bad). Oscar without a training corpus is useless; Oscar with a NC-licences training course is not Open Source (see below). As detailed below, the corpus at sentence level is NC-free licensed, and a lot of training can be done that way. Sufficient?

Peter wrote:

"I pointed out very clearly that CC-NC would mean we couldn’t redistribute the corpus as a training resource (and that this was essential since others would wish to recalibrate OSCAR). Yes, they understood the implications. No they wouldn’t change. They realised the problems it would cause downstream. So we cannot redistribute the corpus with OSCAR3. The science of textmining suffers again."
I do not know if it is factually correct that the RSC would not change (below we read they attempted), or whether the organisation really understood the problems. But, it certainly is a fact that we cannot redistribute Oscar4 as an Open Science project with a NC-licensed clause.

And, I want to add and stress here, that blog posts sometimes are just like press releases: things have the highest impact if written down in a black-and-white fashion; and getting things factually wrong happens to all of us now and then.

One of the outcomes I learned about this week, is that the RSC released the corpus in some form without the NC-clause. The full text paper corpus remained the NC clause of the CC license, but there is also a version where all sentences are released, and this has a CC license without the NC clause. I think this is not optimal, but still very much appreciate the gesture the RSC is making here, and would to kindly thank them for that! And do I want to make that clear too (thanx to Cameron for phrasing it so well in his comment), it is the principle freedom for the RSC to decide what they want to do, and I fully respect that.

Well, with that out of the way, and I wanted to say something about it, having been involved in the discussion, and feeling a bit in between Peter and the RSC here, appreciating both their view points, and having a third one myself, let's focus on this non-commercial clause a bit more.

Of we enlarge our scope a bit, away from written material, to Open Science, it is clear that the non-commercial clause is bad. In the Open Source world, organisations like the Debian project clearly state that non-commercial clauses violate basic freedoms. From an Open Standard point of perspective, this is pretty much the same. The reason, whether you like it or not, we live in a commercial world. Society expects us to me commercial, and any serious business is legally required to make making profit a company goal. Now, this effectively means that any science made available as non-commercial is not Open: you are effectively not giving people the freedom they need to advance science.

In short, a CC license with the NC clause is in fact quite like "yes, we love to be Open, but we are too scared". Now really, I understand this scare. I am a scientist, post-hopping around Europe, not tenured, and not being an experimental scientist, unlikely to become one. Don't tell me about risk and scare of making things Open. Yet, I did, and it payed of (not enough yet; still looking for a fixed academic position, as I already indicated). But in the more than 15 years I have been working now in Open Science, I have yet to find a compelling (or any) argument to back up this fear: the perceived risk of the NC clause has so far not proved any different than a fear of ghosts.

On the other hand, if I would not have been involved in Open Science, I would not have worked for the top European institutes I have been working in the past ten years.

So, what are the arguments for using the NC clause? The fear I understand, but arguments I do not see that support that a NC clause is useful in an Open Science setting.

Further reading:

Saturday, December 03, 2011

CDK-JChemPaint #9: implicit hydrogens and isotopes

Next in this series (after #1, #2, #3, #4, #5, #6, #7, #8), I'll show how to add implicit hydrogens to a drawing. I actually think the BasicAtomGenerator should cover implicit hydrogens, and the ExtendedAtomGenerator anything that requires more CDK modules than just the interfaces, like isotopes. But I discovered that implicit hydrogens currently also requires the ExtendedAtomContainer too late. In fact, there are other things I like to see changed, but I do not have the resources for that right now. So, you will need the CDK-JChemPaint jar (which is not the JChemPaint code!).

In fact, besides these points, it basically just comes down to replacing the BasicAtomGenerator with the ExtendedAtomGenerator. Except a bug I found. I'll fix that in the next release, but right now, the extended atom generator requires the AtomNumberGenerator to be loaded as well, and thus we also must turn atom numbering off. Therefore, we basically get this code snippet (here's the full code):

// generators make the image elements
List<IGenerator> generators = new ArrayList<IGenerator>();
generators.add(new BasicSceneGenerator());
generators.add(new BasicBondGenerator());
generators.add(new AtomNumberGenerator());
generators.add(new ExtendedAtomGenerator());

// the renderer needs to have a toolkit-specific font manager
AtomContainerRenderer renderer =
  new AtomContainerRenderer(generators, new AWTFontManager());

// disable atom number rendering
model = renderer.getRenderer2DModel()
model.set(WillDrawAtomNumbers.class, Boolean.FALSE)

As said, this code will be simpler in the next CDK-JChemPaint release. The results looks like:
As you can see by the amount of whitespace around the carbon, the scaling issue has not been resolved yet :(

Drawing isotope information works pretty much in the same way. In fact, we do not even have to change the rendering code, and the ExtendedAtomContainer automatically adds the isotope information (and no, indeed, not in the expected superscript fashion; so, another thing to fix):

But alas, there are always things to fix. I'm personally not aesthetically pleased with the kerning of just CH4 either.

Thursday, December 01, 2011

CDK-JChemPaint #8: rendering of aromatic rings

CDK can render aromatic rings in two ways: with localized double bonds and with a circle reflecting the delocalized nature of the π electrons. Or, graphically:

The following two code snippets are part of full scripts available from my Groovy-JChemPaint repository, and these two drawings are created with CDK 1.4.6.

To draw aromatic rings with localized double bonds, use this code:

List<IGenerator> generators = new ArrayList<IGenerator>();
generators.add(new BasicSceneGenerator());
generators.add(new BasicBondGenerator());
generators.add(new BasicAtomGenerator());

However, if you like the right aromatic ring style more, you replace the BasicBondGenerator by the RingGenerator, and use this set of IGenerators:

List<IGenerator> generators = new ArrayList<IGenerator>();
generators.add(new BasicSceneGenerator());
generators.add(new RingGenerator());
generators.add(new BasicAtomGenerator());

That's it. Here's the full script.

Wednesday, November 23, 2011

Keeping it cool... tracking CPU temperatures on Debian GNU/Linux with a 3.1 kernel

My laptop is getting old (almost 12 months now), and is starting to see the first symptoms of age. Probably just dust piling up, but I have been experiencing CPU overheating. A week or too I found a nice one-liner to keep an eye on the CPU temperature, but a kernel upgrade to 3.1 broke that. Here is a script that works on my Linux laptop:

( cd /sys/class/thermal && while :; do line="`date`:`cat */temp | cut -c1-2 | awk '{ printf(\" %03d\", $1) }'`";   echo "$line";   sleep 5; done ) | tee LOG

Wednesday, November 09, 2011

The simplest way to make CDK commits

Every now and then I people who show interested on working on the CDK. I reply to them what is involved, and I rarely here back from them. I know this is common for most open source projects (see also Community development), and for the CDK this is likely caused the cumbersome process of getting a full development environment set up. Over the next months, I will make an effort to extend my Groovy Cheminformatics book to include detail after detail on how to do this. But what would also be welcome is a VM (OVF) image that has everything set up and well.

Anyway, but the road to CDK commit fame does nowadays not require a full-fledged development environment. Instead, we have GitHub. Their web interfaces makes a lot of things easy, including source code peer review.

But in this post I would like to show how easy it is to fix small things in the CDK, by using the GitHub GUI. Of course, this post can be used for any project hosted on GitHub.

Step 1
Get a free GitHub account. (And log in.)

Step 2
Find a problem in the CDK. Start with something dead easy, like JavaDoc errors. For example, check the Nightly report for OpenJavaDocCheck errors here. These pages will return a lot of errors about missing documentation, but skip those. Do something really simple, like reports like this one:

There is no period to end the first sentence: 'Sums up the columns in a 2D int matrix'

JavaDoc has a special purpose with the first sentence in any JavaDoc: it serves as a summary. The detect the first sentence, it must properly end with a period.

That patch cannot get any easier. It just requires a missing period to be added.

Step 3
Identify the source file that contains the error. This has the added value in that you automatically learn your way around in the directory/folder hierarchy of the CDK project source. The above error refers to this class:


Now, all functional CDK code (that is, everything but the unit test suite) can be found in the source distribution under src/main, but we need the GitHub URL for that, and that is here (note that the linked OpenJavaDocCheck report is for the stable cdk-1.4.x branch, so our GitHub page for the PathTools source too):

Check this URL carefully, and note where it keeps the branch name, the src/main folder, and the path to the source. That makes finding other source code pages later easier. This particular page looks like:

Step 4
Now, this source code page has (when logged in) a 'Edit this file' icon right of the file name line. Click this icon, and GitHub will present you with a basic, in-browser editor:

I already scrolled down a bit, to the line with the missing period from this example. Make the modification, and scroll down to the lower part of the page, and read step 5.

Step 5
With the small fix done, it is time to make the actual commit. Below the editor there is a text field to enter a commit message (important: describe what you did, even if this takes more time than the fix itself! Reason: when browsing commits in changelogs, you only see those messages!):

If you have multiple JavaDoc fixes, put them in one commit. But, preferably do not mix them with other fixes, as to keep the commit message as well as the peer-review simple. That speeds up the reviewing process, and makes it easier for me and Rajarshi to apply to the main source tree, but more about that in the next steps.

Of course, this online editing can also be used for fixing PMD warnings, as reported by this Nightly report. However, keep in mind that you cannot recompile the code this way, and for code changes, this online approach is discouraged.

When done, press 'Propose File Change' (Rajarshi and I see a 'Commit Changes' button instead). After a new page is opened, the commit has been created, and it is time to inform us of your commit. This is done via a so-called 'pull request', as outlined in the next step.

Step 6
The last step in the process is to send out a pull request. A page to do this is normally the immediate result from hitting that 'Propose File Change' button, and should look something like the following (note that I could not make a screenshot based on the running CDK example, because I have commit rights, and the patch goes directly into the repository; I discovered that in this patch :):

So, while for another GitHub project (Total-Impact is worth checking out), this page should look similar. The top grey bar show the project name and the 'Send a pull request', confirming that this page does what we are expecting. In the blue box a comment is given on where your commit is stored, which is in your own fork of the CDK for your own GitHub account, in a branch called patch-x.

Below that blue box, reference is made to your newly-made commit, and a bit further below two text fields, a single line text box for a message 'subject' prefilled with the commit message, and a text box where you can leave a message to accompany the pull request. This message is used to put the pull request in perspective, and can be used to introduce yourself briefly, refer to a set of patches, or whatever. This message will not end up in the git repository. The more requests you make, the smaller this message will get. "Yeah, another JavaDoc fix."

Hit the green 'Send pull request' button, and you're done.

Saturday, November 05, 2011

Online SEURAT workshop: Omics data analysis for Toxicology

Another online meeting announcement (previously on Open Data and LarKC). BTW, DAWG is short for Data Analysis Working Group:

Omics data analysis for Toxicology
Tuesday November 15th 2011, 15:00 CEST, 6:00 PDT

(organized by ToxBank & SEURAT-1 DAWG)

Storing in the ToxBank data warehouse and sharing it among SEURAT-1 is not the only goal of a omics data integration effort. Combing omics data sets available from the data warehouse will provide knowledge not visible from a single data set. The ToxBank platform aims at making these kinds of omics analysis possible.

We invited Prof. Roland Grafström to present the omics data analysis work in cancer genomics for a seminar to speak on their recent work in the field. Prof. Grafström is partner in the ToxBank project and is associated with the Karolinska Institutet medical university in Sweden, and the governmental research institute VTT in Finland.

The presentation will highlight the interpretation of gene expression data from the application of a combination of bioinformatics tools, including the Ingenuity Pathway Analysis software and the Gene Ontology. Basic concepts and terminologies will be dealt with including for integration of omics data, in vitro to in vivo extrapolations, as well as retrieval and validation of biomarker genes in large data sets. Work aimed at tumor biomarker discovery in head and neck cancer will be presented, but the results will discussed in the context of the work planned in the SEURAT cluster.

The meeting will be held online and will be organized via GoToMeeting. There are only 25 seats, so registration is important, as we will fill seats on a first-come basis. However, one ‘seat’ can host multiple scientists behind a single computer, if needed.

Please send an email to Egon Willighagen ( listing your name, SEURAT-1 project, and email address. Details on how to log in will be send to that address shortly before the meeting.

Wednesday, November 02, 2011

Going to Maastricht to work on Open PHACTS

Some two months ago we decided to go back to the Netherlands, after having lived for more than three years here in Sweden. We have had a great time in our three houses, but feel a need to settle down, closer to family.

A week later I was contacted by Chris Evelo contacted me for a position to work on Open PHACTS. Chris and I only met for the first time in March, when I visited the group when I had a conference in Maastricht. His bioinformatics group is very much into Open Science and with a good track record in metabolism and transcriptomics analyses (something we do here at KI too), and Open PHACTS is an interesting EU project into the application of semantic web technologies to the life sciences, something I have worked a lot on in the past two years.

In the next two months here at KI, I'll be working hard on finishing my work for the other great EU project, ToxBank, on which I am working now. I personally see clearly how these projects complement each other, but no clue if such can be given shape at a EU level, where there is intention and consortium agreements :)

And, of course, I got a bit of funding rewarded here at KI, that I will use next year for two or three visits back to Stockholm, because there is some low-hanging fruit that remains to be picked.

All in all, I am very much looking forward to my next post-doc position and definitely my last. What's after that, I won't have to care about in at least the next two years :)

Tuesday, November 01, 2011

CDK 1.4.5: the changes, the authors, and the reviewers

CDK 1.4.5 just got uploaded to SourceForge, about a month after the 1.4.4, though mere minutes after the 1.4.4 release notes. CDK 1.4.5 is the fifth bug fix release of the 1.4 series and brings another few bug fixes.

The changes include fixes to the JavaDoc generation, now outputting proper citations of PhD thesis and books, a fix in the SDFWriter to inherit the IO options from the underlying (used) MDLV2000Writer, restored atom type perception in SMILES parsing if aromaticity is not actively perceived (in line with earlier 1.4.x behavior that unfortunately got broken due to another fix), a fix in the MDLV2000Reader to deal with pseudo atoms with numbers greater than 9 (thanx to John May!), a fix in the sorting of IAtomContainers, and a fix for the elusive bug in the AWTRenderer causing thin bonds (e.g. due to zooming) to become grey.

The changes
  • Use a minimal stroke width in the AWT output (fixes #3295256) 1387b7b
  • Changed how data files are copied: copy those specified in the src/META-INF/*.datafiles (fixes #3430342) 68536e1
  • Perceive atom types also when aromaticity from the SMILES is kept 2413197
  • Added unit test to make sure atom types are also perceived when aromaticity from SMILES is kept d6b8c8e
  • Removed broken link and fixed syntax of @cdk.cites. 22c59fc
  • The SDFWriter now accepts all MDLV2000Writer's IOSettings too (fixes #3392485) aa6ca5f
  • Unit test for bug #3392485: SDFWriter not accepting MDLV2000Writer IO settings ab54ed5
  • Fixed @TestClass annotation to point to the correct class a7cbe48
  • Updated unit test to be independent of atom index and just match the coordinates 8d3eed0
  • Fixed bugged when reading MDL V2000 files. If the atom number of a pseudo atom was greater then 9 it would not be read correctly fd90ed5
  • Depend on standard to, to have access to AtomContainerComparator 76db769
  • Create IMolecule's instead of IAtomContainer's because IMoleculeSet can only contain the former; fixed the exception currently thrown by e.g. MoleculeSetTest 1aae7fa
  • Fixed sorting with null IAtomContainers (based on a suggestion by Mark in the bug report #3093241) 1bf194b
  • Missing unit test for AtomContainerSet.sort(Comparator) 0585bcd
  • Added a unit test to check that molecular descriptors do not throw Exceptions when disconnected structures are passed 853ca50
  • Added support for book and phdthesis reference types 277dd03
The authors
17  Egon Willighagen
 1  John May
The reviewers
10  Rajarshi Guha 

CDK 1.4.4: the changes, the authors, and the reviewers

CDK 1.4.5 just got uploaded to SourceForge, and when I looked up the link to the notes for 1.4.4 I noted I had not released those notes yet. So, here are those first.

CDK 1.4.4 is the fourth bug fix release of the 1.4 series and brings another few bug fixes. The changes include new cobalt atom types, a fix to ensure that all molecular descriptors are properly recognized by the build system, and a fix for the Log4J-based LoggingTool, to properly handle nulls. So, not so many changes, which probably explains why I forgot to blog about it earlier.

The changes
  • Co final abf6b33
  • Fixed adding all descriptors to the qsar-descriptors.set file by using the correct number of chars to skip (@cdk.set length is 8 not 11). a75eece
  • Provide debug info on the classpath in which will be searched 88070ac
  • Check for a null input 5cd07d0
  • Unit test for a NullPointerException in the LoggingTool caused by a null message in an Exception. The error only shows up when debugging is turned on. 1e07adc
  • Split up to test the LoggingTool also with debugging turned on; so far, only with debugging turned of was tested 58bb243
The authors
 7  Egon Willighagen
 1  Gilleain Torrance
The reviewers

3  Rajarshi Guha 
1  Egon Willighagen

Oscar4 paper: text mining in Bioclipse (and everywhere else, of course)

The Oscar4 paper (CC-BY, just like the screenshots of the paper below) was out already some days now, but the formatting has finished:

I spotted a rogue 'http://' in the code example b) in Appendix B:

I'll see what I can do about that, but the API might evolve a bit anyway.

That leaves me to mention that Bioclipse has an Oscar extension (Bioclipse has a lot of functionality nowadays, in fact), and that I blogged several times on Oscar4 when I was working with the other authors on the refactoring last year.

Tuesday, October 25, 2011

"Post-doc with experience in CDK programming wanted"

SMARTCyp (see papers below) is an integrated computational approach that mixes cheminformatics with molecular modeling approaches to predict the metabolic fate of molecules. This fate is important to various biological aspects of small molecules, and the metabolism can active a prodrug into a drug, make a toxic compound non-toxic, and a non-toxic compound risky.

The tool has been well received by the community, complementing other approaches. Now, the reason why I blog about these papers now, is that the tool uses the CDK for the cheminformatics parts, which I find really cool. In fact, the project has resulted in good feedback on the CDK. In fact, the project received further funding creating a short-term open position to continue research on enzymatic reactivity of molecules in the cytochrome P450 family, as outlined below.

So, this peer-review is a bit more on the impact of the CDK and SMARTCyp on the academic landscape, than on the content.of the paper.

Patrik Rydberg wrote on the CDK LinkedIn group and the CDK user mailing list about an open post-doc position where CDK expertise is welcomed:
    I'm seeking candidates for a post-doc position in applied cheminformatics at the University of Copenhagen. We are working on drug metabolism prediction models, and our results are as far as possible made into open source software based on the CDK. The group has previously developed the SMARTCyp site-of-metabolism prediction software which is based on CDK, and this project aims to extend the scope of our cytochrome P450 project.

    Employer: department of medicinal chemistry, faculty of pharmaceutical sciences, University of Copenhagen
    Location: Copenhagen, Denmark
    Position: post-doc
    Duration: 7 months, starting january 2012
    Project: drug metabolism by cytochromes P450

    Experiences required:
    Machine learning methods
    java programming

    Bonus for experience in java programming using the Chemistry Development Kit (CDK), experience in development of ligand based virtual screening methods, and experience of work on the cytochrome P450 enzyme family.
The SMARTCyp paper can be found linked to below, with the details of how the CDK and SMARTCyp interoperate and make P450 predictions.

ResearchBlogging.orgRydberg, P., Gloriam, D., & Olsen, L. (2010). The SMARTCyp cytochrome P450 metabolism prediction server Bioinformatics, 26 (23), 2988-2989 DOI: 10.1093/bioinformatics/btq584

ResearchBlogging.orgRydberg, P., Gloriam, D., Zaretzki, J., Breneman, C., & Olsen, L. (2010). SMARTCyp: A 2D Method for Prediction of Cytochrome P450-Mediated Drug Metabolism ACS Medicinal Chemistry Letters, 1 (3), 96-100 DOI: 10.1021/ml100016x

Saturday, October 22, 2011

ChEMBL-RDF: Uploading data to Kasabi with pytassium

I reported earlier how to I uploaded the ChemPedia (RIP) data onto Kasabi. But for ChEMBL-RDF I have used the pytassium tool, not just because it has a cool name :) I discovered yesterday, however, that I did not write down in this lab notebook, what steps I needed to take to reproduce it. And I just wanted to uploaded new triples to the ChEMBL-RDF data set on Kasabi.

The new triples I wanted to upload, link the new public CHEMBL identifiers (like CHEMBL25 for aspirin) to the internal ChEMBL database identifier I used for ChEMBL 09 for the URIs. So, I am adding a lot of triples like:

<> <>

And the pytassium code I use to upload this to Kasabi looks like:

import pytassium
import time

dataset = pytassium.Dataset('chembl-rdf','XXX')

# Store the contents of a turtle file
dataset.store_file('chemblids.nt', media_type='text/plain') 

So, that omission in my log book has been corrected now.

Thursday, October 20, 2011

CDK & File Formats #1: MDL molfiles and bond order 4

I just had a conference call on one of the translational cheminformatics projects I am involved in: Bioclipse-OpenTox. A paper about this project has been submitted, and we are writing up a more practice oriented book chapter (almost done). In writing up a use case, we ran into a recurrent problem: proper cheminformatics handling of input files. Ola suggested to start writing more extensive documentation on what users of the CDK are supposed to do when reading a file. So, here I start a new series.

But before I start writing up how to work with MDL molfiles in the CDK, I like to stress three key design principles in the CDK:
  1. there is no single solution to everything (aka there are multiple solutions to the same problem)
  2. algorithms must be modular (the LEGO building block approach)
  3. the user is responsible for using the right blocks at the right time
These principles have a profound effect on the usability of the CDK. And here I must stress the point I made recently on usability: what happens if you neglect less abundant personas. Well, in this case, the abundancy is actually somewhat different. But, CDK libraries target a few personas: one of these is the scientist that uses cheminformatics as a mere tool, who doesn't know graph theory, let alone file formats; this personas works in the field of translational cheminformatics. Let's call him Tony. The other personas has a more extensive education in cheminformatics (e.g. former Gasteiger lab, Sheffield, former CAOS/CAMM (like me), etc) or are actively involved in cheminformatics research (like that of Christoph, and many, many more). This personas we will call Carry.

The CDK, with its limited resources, must target both Tony as well as Carry; both have widely different needs, and somewhere, in our free hours, we must work on solutions for both personas. Carry and Tony are, of course, exaggerations, and all readers (and me too) are linear combinations of these personas. In fact, you can even be a Carry in some parts of the CDK, while being a Tony in others. The CDK has so much functionality nowadays, that even I have unexplored corners.

And, practically, the building block approach is important to Carry (she might want to plug in her own aromaticity model), but a killer for Tony.

Missing information
A prominent, recurrent problem for the first personas, is to deal with missing information. And input files have missing information. Some formats are more explicit than others. This series will focus on these aspects, and discuss how the CDK can be used to add that missing information.

Another important aspect is that the CDK data model cannot hold all information. For example, I have no clue how the CDK should read MDL files with a muonium (browse the cdk-devel mailing list archives of this month). Here too, this is not always clear to Tony, who does not have the time to read file format specifications, nor CDK interfaces. He just wants the CDK to do its thing which it is supposed to be good at.

MDL formats
So, here we are. We have a MDL .mol file. They are pretty much the community standard, and even pretty Open too. You can now find the specs in the ctfile.pdf, readily available on the web. Actually, they are no longer called MDL formats, but Symyx formats, umm, Accelrys formats. These formats define a number of file formats, including the aforementioned Accelrys molfile, the Symyx SD file, but also query formats, used to store queries against their database software.

Like any file format, they support a number of features. For example, MDL files cannot represent a bond order 4, a quadruple bond. Organic chemistry doesn't need them. Moreover, hydrogens are often implicit, as they can easily be added later, and both memory and disk space is expensive (think 80-ies). Stereochemistry is wedge-bond-based and others.

Well, more recent MDL formats have become more powerful. The V3000 format can do much more then the V2000 format, or even the pre-V2000 format.

Now, the trigger for the start of this series is the bond order 4 in MDL files. Strictly speaking this is not part of the molfile format, nor of the SD file format; instead, it's part of the query format. However, the community ignored that part of the specification, and the molfiles and SD files are commonly using this query type to represent aromatic bonds. Well, or bonds that can be both single or double.

Now, the CDK does not have a structure to represent a single OR a double bond order. The CDK only has SINGLE, DOUBLE, TRIPLE, and QUADRUPLE (see IBond.Order). Moreover, it has a separate mechanism to indicate if a bond is aromatic. A flag is used for that. This allows the CDK to store both kekule-ized bond order localization and aromaticity perception information separately.

So, if a MDL molfile (or SD file) is read with the CDK in RELAXED mode with the MDLV2000Reader, order 4 bonds are read as SINGLE bonds with a flag indicating it is aromatic. If only the CDK had IBond.Order.UNKNOWN. This is scheduled for master. Mind you, this is a complicated patch, because a lot of algorithm introspect the bond order information, which will have to be updated to handle UNKNOWN bond orders.

This is a nasty co-incidence (interaction effect) killing Tony's use case: the MDL molfiles in the wild have information that the CDK cannot represent, sadly. Now, if users would have been paying for CDK releases, we could have assigned a developer on it. We have the mechanisms in place to buy a copy of the CDK, so that's not really an excuse (price negotiable; maybe 4999 SEK, but you are welcome to buy a campus-wide CDK version for more). If enough people / companies would do this, we could hire a developer to implement this particular use case. This would be a great way to support the project!

Anyway, the purpose of this series is not to rant about these things, but to describe how file formats might be read with the CDK. Here's a recipe with inline comments (in the Groovy syntax):

reader = new MDLV2000Reader(
  new File("data/azulene4.mol").newReader(),
azulene = Molecule());

// perceive atom types

// add missing hydrogens
adder = CDKHydrogenAdder.getInstance(

// if bond order 4 was present,
// deduce bond orders
dbst = new DeduceBondSystemTool();
azulene = dbst.fixAromaticBondOrders(azulene);

Carry would know this. In fact, Carry would probably have some comments on this recipe. Well, Carry can leave those in the comments of this post.

Tuesday, October 18, 2011

The Blue Obelisk Shoulders for Translational Cheminformatics

I guess reader of my blog already heard about it via other channels (e.g. via Noel's blog post), but our second Blue Obelisk paper is out. In the past five-ish years since Peter instantiated this initiative, it has created a solid set of shoulder on which to developed Open Source-based cheminformatics solutions. I created the following diagram for the paper, showing how various Blue Obelisk projects interoperate (image is CC-BY, from the paper):

It shows a number of Open Standards (diamonds), one Open Data set (rectangles), and Open Source projects (ovals). What does diagram is not showing, is the huge amount of further Open Source cheminformatics projects around, that use one or more of the components listed here, but which do not link themselves to the Blue Obelisk directly. And there are many indeed, both proprietary and Open.

I am proud of this diagram: it really shows that the interoperability we set out in the first paper worked out very well! This makes the Blue Obelisk an excellent set of shoulders to do translational cheminformatics.

Translational cheminformatics?? Well, I have been looking for a while for a good term for my research regarding all that hacking on the CDK, Bioclipse, etc. Now, that's the translation of my core molecular chemometrics research to other scientific fields, like metabolomics, toxicology, etc.

ResearchBlogging.orgGuha, R., Howard, M., Hutchison, G., Murray-Rust, P., Rzepa, H., Steinbeck, C., Wegner, J., & Willighagen, E. (2006). The Blue Obelisk - Interoperability in Chemical Informatics Journal of Chemical Information and Modeling, 46 (3), 991-998 DOI: 10.1021/ci050400b

ResearchBlogging.orgO'Boyle NM, Guha R, Willighagen EL, Adams SE, Alvarsson J, Bradley JC, Filippov IV, Hanson RM, Hanwell MD, Hutchison GR, James CA, Jeliazkova N, Lang AS, Langner KM, Lonie DC, Lowe DM, Pansanel J, Pavlov D, Spjuth O, Steinbeck C, Tenderholt AL, Theisen KJ, & Murray-Rust P (2011). Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on. Journal of cheminformatics, 3 (1), 37 PMID: 21999342 DOI: 10.1186/1758-2946-3-37

Wednesday, October 12, 2011

Tricks I learned today #2: importing ontologies into a Semantic MediaWiki

I learned a second trick today (see also this first); this one is about the Semantic MediaWiki (SMW). I was using a trick I learned from RDFIO before, setting Equivalent and Original URIs (though the difference between those, I lost). But I ran into the problem that these equivalent URIs cannot contain hashes (#), or not always it seems.

After some googling, I did not find an answer, and turned to the SMW IRC channel. Saruman was helping out and pointed me to the Equivalant URI wiki page. I had looked at that earlier, but now turned to the Import vocabulary page. While this was not what Saruman wanted me to look at, it did turn out a nice workaround. The wiki page shows how to import external ontologies. This trick requires you to craft a wiki page with a special name, starting with MediaWiki:Smw_import_ followed by a namespace. It exemplifies this with the FOAF ontology, which is in fact an ontology I was also using equivalent URIs for.

So, I created a new MediaWiki:Smw_import_foaf page, with this content:|
  [ Friend Of A Friend]

And another page, MediaWiki:Smw_import_rdfs, with this:|
  [ RDF Schema]

Then, in a Property page, which I want to make equivalent with a property in FOAF or RDF Schema, I can now simply use [[imported from::foaf:homepage]] or [[imported from::rdfs:seeAlso]], in Property:Has_homepage and Property:See_also respectively.

That's a neat trick, /me thinks!

Tricks I learned today #1: as.integer() on factor levels

I normally work with full numerical data, not categorical data. R, when using read.csv() seems to recognize such categories and marks the column as to have factor levels. This is useful indeed. However, I wanted to make a PCA biplot on this data, so was looking for ways to convert this to class numbers. After some googling we, Anna and me, ran into as.integer() which can be used on the factor levels. So, today I learned this trick:

> a = as.factor(c("A", "B", "A", "C"))
> b = as.integer(factor(a))

Well, probably basic to many, it was new to me :)

Now, wondering if it is equally easy to convert it into a multi-column matrix where each column indicates class membership (thus, resulting in three columns for the above...). That's another trick I need to learn...

Tuesday, October 11, 2011

Blogs I Follow: Henry Rzepa

Just in case you have not run into Henry's blog yet, check it out. His blog makes me so jealous I did not follow up on my basic quantum chemistry education. Implementing Hartree-Fock in Fortran is not nearly as interesting or useful as the stuff he has been blogging about. A second reason you should, is his brilliant use of Jmol (Henry is one of those using Jmol for more than 10 years). This is his blog in action:

His double gravatar is not vanity, but a bug in one of blog extensions he is using creating those blue z icons. I think the French in the blog title is, while unlikely for British I always understood, deliberate.

Henry, if you're reading this... what about adding permalinks to those Jmol visualizations (or DOIs, like data cites)? Second, I'd love to be able to download to be able to download a visualization, and open it in Jmol or Bioclipse. Would such be possible in CML or JVXL (think datument)?

Useless statistics: blog visitor by OS

The market is seriously changing now. Another year, and Microsoft Windows is no longer the majority OS. Of course, my blog is very specific, and these statistics do not map well to global market shares.

Monday, October 10, 2011

Call for Help: categorizing Open Data repositories for chemistry

Where to host chemistry data? This was the question two people asked a few weeks ago:
I had these two blog posts open in my browser since about the time they were blogged, intending to reply. But I could not come up with a good answer, despite I was hoping to do so. For RDF-based data there are a few options now, such as Kasabi and Science 3.0. Also, for crystallography data there is the Crystallography Open Database, and for quantum chemical calculations there is Quixote. And, of course, annotated NMR spectra can go into the NMRShiftDB.

But for chemistry data in general I do not know a solution. What to do with JDX files, with images of chromatograms? BioTorrents perhaps? But that is mostly for large data sets, and does not have a clear indexing approach. ChemSpider, as Jean-Claude has been doing for spectra (see this YouTube video)? ChemSpider does not have a solution to extracting the Open Data.

These are features such a repository must have:
  1. allows you to specify who is the owner, creator, or similar
  2. allows you to license the data, or, to waive your rights, per Panton Principles
  3. allows users to bulk download Open Data
  4. allows users to automate data extraction
  5. data should be indexed, at least by InChI (which just got a 1.04 release)
  6. support any format
Optionally, these extras are welcome:
  1. semantic annotation of repository content
  2. provide CMLRSS feeds of new content
So, hereby this call for help: let's categorize what repositories are around that fulfill the 6 required features (or come very close). We can start of using regular blogging practices, by blogging solutions, ideas, comments, etc in reply, or use the commenting facilities here. Any activity in this area is appreciated and most welcomed by the community.

Saturday, October 08, 2011

An ontology for QSAR and cheminformatics

QSAR and QSPR are the fields that statistically correlate chemical substance features with (biological) activities (QSAR) or properties (QSPR). The chemical substance can be molecular structures, drug (which are not uncommonly mixtures), and true mixture like nanomaterials (NanoQSAR). Readers of this blog know I have been working towards making these kind of studies more reproducible for many years now.

Parts of this full story include the Blue Obelisk Data Repository (BODR), QSAR-ML, the CDK for descriptor calculation, the Blue Obelisk Descriptor Ontology (BODO, doi:), still used by the CDK, and in the past by JOELib too, and much, much more. Really, I still feel that the statistics is by far the easiest bit in QSAR modeling.

New in this list of tools to make QSAR more reproducible, is the CHEMINF ontology, which further formalizes cheminformatics computation. In a collaboration with Janna and Christoph (EBI), Michel and Leonid (Carlton University), and Nico (formerly at Cambridge, now at CSIRO), we have cooked up an ontology, and the computational bits of it are captured by the below figure from the paper that just appeared in PLoS ONE.

Both the paper and the ontology have a Creative Commons license. The ontology has already been used by Leonid in other papers, and I have been using it already in the RDF-ed version of ChEMBL.

Next steps for me regarding this ontology is to convert to BODO to be based on CHEMINF, but highly interesting too is a reformulation of QSAR-ML to be based on CHEMINF. The QSAR markup language was long started before RDF came into the picture, so please forgive us for now using RDF from the start there.

One particularly interesting aspect this ontology captures is the difference between molecular entities and mixtures. Not uncommonly, QSAR studies correlate drugs to their binding affinities, even if those drugs are in fact mixtures of stereoisomers. While 0D, 1D, and 2D descriptors are not affected, geometrical descriptors most certainly are. Moveover, the modeled endpoint is very possibly the property of only one of the stereoisomers, most certainly for binding affinities. Yet, many QSAR study reports in literature do not record such details. The CHEMINF ontology defines the terms you need to publish such details.

ResearchBlogging.orgHastings, J., Chepelev, L., Willighagen, E., Adams, N., Steinbeck, C., & Dumontier, M. (2011). The Chemical Information Ontology: Provenance and Disambiguation for Chemical Data on the Biological Semantic Web PLoS ONE, 6 (10) DOI: 10.1371/journal.pone.0025513

Sunday, October 02, 2011

CDK's getNaturalExactMass(): isotopic and atomic weights

A recurrent question for the CDK project is the about the AtomContainerManipulator.getNaturalExactMass() and why it return NaN for some elements. There are various incarnations of the issues here, but the key here is the difference between the various weights a molecule can have.

Monoisotopic molecular weight
The weight of a single molecule is well-defined, but requires you to know which isotopes are present in the molecule, which you typically do not. Each isotope of an element has a natural abundance. For example, carbon isotopes found on earth are mostly 12C, and a bit of 13C. In fact, 12C is about 99 times more abundant than 13C. The same goes for 1H and 2H: the first is way more abundant than the latter. This means that the molecular weight of a single methane molecule can only be calculated which isotopes are present in the molecule. But, the most occurring combination, is the molecule with just the major isotopes, SMILES: [1H][12C]([1H])([1H])[1H]. A few elements have two isotopes which are almost equally abundant, such as bromine.

Natural molecular weight
However, in experimental chemistry we are not looking at individual molecules (well, we can nowadays), but typically at substances of those molecules. Substances contains very many single molecules (remember the Avogadro's constant). So, we have a mixture of molecules with different isotope ratios. As indicated earlier, the majority of those molecules contain only major isotopes, but as a substance we rather just take the natural mass of the molecular, which is an average weight for that element (called atomic weights), giving the natural abundances of the element's isotopes. Mind you, these atomic weights are not constant, as recently made the news. Only isotopic weights are constant, the atomic weights depend on the ratio of isotopes, which varies around the world.

But, returning to AtomContainerManipulator.getNaturalExactMass() method, it should be noted that atomic weights are also only defined for elements which in fact are found on earth; elements which are naturally found, and for which the isotopic abundances can be calculated. Thus, atomic weights are not defined for synthetic elements, which we have more and more.

So, if you call this getNaturalExactMass() method for a molecular structures which has one or more elements with that do not occur naturally, you will get an NaN answer. In fact, I guess the method should throw a CDKException with a message like "Hey dude, you have non-natural elements in this molecule!".

Getting practical: technetium
Technetium (Tc) is such an element which does not have any naturally occurring isotopes, so the CDK cannot calculate a atomic weight. And that is the answer to this bug report.