Saturday, December 31, 2011

CDK 1.4.7: the changes, the authors, and the reviewers

In preparation of the next (4th) edition of my Groovy Cheminformatics book on cheminformatics with the CDK, I found a show stopper bug, fixed it, sent in the patch, and Rajarshi quickly reviewed and applied it to the cdk-1.4.x branch. This particularly bug was a null pointer exception that was fixed not so long ago in the log4j implementation, but turned out to be present in the logger to STDOUT too.

This releases also fixes the reading of aliased atoms in MDL V2000 molfiles, thanx to another bug fix patch from John May (thanx!), and formally deprecates the nonotify implementation, which has already been removed from the master branch. The silent module should be used instead, which has the same functionality but has cleaner code and faster.

However, one important change you should take notice of, is an API change in the IIteratingChemObjectReader class. The change is minor, but useful. The interface is now typed, and implementing classes implement IIteratingChemObjectReader<IChemModel> (IteratingPCSubstancesXMLReader) or IIteratingChemObjectReader<IAtomContainer> (IteratingMDLReader, IteratingPCCompoundASNReader, IteratingPCCompoundXMLReader, IteratingSMILESReader). This means, that this iterator's next() method now returns an IChemModel or an IAtomContainer, and that casting in the using code is no longer needed.

The changes
  • Another hot fix: use @link with the full qualified class name, and removed the import, to fix a dependency issue 0e71cba
  • Added a @deprecated tag on the nonotify data classes, pointing to the silent implementation d283686
  • Fixed dependencies 5ef20b1
  • Extend the abstract suite, so to run the test for the null pointer exception 269c84c
  • Work with the interface 106e5ec
  • Check for a null input fb35047
  • Removed unneeded deps on CMLXOM for JNI-InChI (thanx to Dmitry Katsubo). 8524891
  • Added missing imports of IAtomContainer, needed by the last two patches, but which were not needed in master because we did all that IMolecule/IAtomContainer refactoring already 856f83c
  • Proper typing of the DefaultIteratingChemObjectReader, so that other classes can safely extend it (thanx to Nina) 6de90d3
  • Typed the iterator, removing the need for casting when used 44b7e76
  • Added John May as author 1142dc6
  • Also check that there are two such R1 atoms 962b7d2
  • Added modifications and unit test for alias atom naming patch bd4b094
  • Corrected alias atom naming in MDLV2000Reader and added test 23132a0
The authors

13  Egon Willighagen
  2  John May

The reviewers

6  Rajarshi Guha
2  Egon Willighagen
1  Nina Jeliazkova

CDK 1.4.6: the changes, the authors, and the reviewers

OK, I forgot to write those up again :(

Release 1.4.6 of about a month ago, fixes a few bugs, including broken JavaDoc, atom type perception when SMILES are parsed while keeping the lower case formalism as aromaticity indicators (I will not discuss the pros and cons of that here), and the Chi index descriptors for sulphurs. This release also introduces a new fingerprint, based on an extensive list of biologically-relevant substructures identified by Klekota and Roth in 2008 (doi:10.1093/bioinformatics/btn479). This functionality was backported by Jonathan from the PaDeL software by Yap Chun Wei. The rest is a bunch of small code and dependency clean ups as well as new unit tests.

The changes
  • Added missing unit tests 9119aa2
  • Added get-methods for information needed for extensions 4525cbe
  • A few missing unit tests in the 'qsar' module 2356a10
  • Added further methods needed for CDK-JChemPaint 804a6f5
  • Added missing JavaDoc. e316210
  • No longer complain about missing testing for abstract classes 284ff84
  • Typo: there → their 640b6e6
  • Added unit testing 05216f0
  • Throw a descriptive exception when 2D coordinates are missing (fixes #3355921) cdc4cbd
  • Fixed the cheminf.bibx well-formedness (fixes #3435367) 7bc0772
  • Added missing @cdk.githash. fdd3d22
  • Updaetd chi index util to correctly evaluate deltav for sulphurs. Fixes bug 3434741. Added unit test 8225175
  • Use interfaces instead of implementations 434c9b1
  • Use interfaces instead of implementations 76dcdf7
  • Use interfaces instead of implementations 5bef796
  • Use interfaces instead of implementations b6ed6a7
  • Moved the pi-contact descriptor (atom-pair) to qsarmolecular, removing the depedency of qsar on reaction d902312
  • Added a missing dependency; it now finds PDBPolymer f03eb3c
  • Fixed test method names f6562cb
  • Added a missing test a853c91
  • Fixed TestClass annotation d37961d
  • Added tests for the isomorphism module to the proper suite 657c3a7
  • fixed dependency for fingerprint tests 135edeb
  • added test for getSubstructure d1eb951
  • lookup SMARTS at index in Substructurefingerprint 65602db
  • Wrote test for KlekotaRothFingerprinter 2b1288b
  • adapted to CDK 84aefixesd70
  • Import from the source code of PaDEL-descriptor (doi:10.1002/jcc.21707) 291def4
  • Fix to use interfaces as argument instead of classes 5e828f5
  • Perceive atom types also when aromaticity from the SMILES is kept 2e76ff6
  • Added unit test to make sure atom types are also perceived when aromaticity from SMILES is kept b315ee6
The authors

25  Egon Willighagen
  5  Jonathan Alvarsson
  2  Rajarshi Guha
  1  Nina Jeliazkova
  1  Yap Chun Wei

The reviewers

13  Rajarshi  Guha 
  8  Egon Willighagen 
  1  Nina Jeliazkova 

Monday, December 19, 2011

A Google+ page for the CDK

A week or two ago I created a Google+ page for the CDK, which can be found here and it looks like:

I will use this page to post interesting stories around the CDK. It is not supposed to replace the Planet CDK, which aggregates blog posts from CDK developers and users, but for other, perhaps shorter posts. For example, as can be seen in the above screenshot, I have started to use it as a CDK Literature replacement, which originally was a series of articles in CDK News (the archives), and later hosted in my blog (see CDK Literature #5 for links to the other four).

But, I will (re)share anything I find useful for the CDK community.

Friday, December 16, 2011

Open Source cancer research (must see video)

Four years ago I wrote up a passionate post about the importance of ODOSOS, and about a year ago written about how research can be open sourced, and how the Open Source Chemistry Development Kit (CDK) fits in. I am proud that the CDK is enabling so many researchers to do novel research! Google Scholar reports over 200 documents and Web of Science gets to over 150 for just the first CDK paper (see also GS vs WoS). That's serious impact. We're far from being fully comparable with existing commercial tools, which have a 40 year head start. But with important functionality still missing (e.g. E/Z stereochemistry), and about 150 open bug reports which needs looking into, we surely can need funding and help!

Anyway, this Monday we had the last of this years Stockholm Open Science meetings, and Carl Bärstad joined, whom is organizing TEDx talks here in Stockholm, and he pointed me to this must see video on open source cancer research:

I can highly recommend watching it, as it is both very insightful about cancer (it's the third mechanism I know now how cells remember state, after DNA methylation, microRNAs, and now plain, boring protein; I wonder when metabolite-sized molecules show up as cell-division surviving state preservatives; they will).

But, it also puts Open Science ideas to practice in drug discovery. Yeah, the even give the structure (and SMILES) of JQ1 (see PubChem):

Now, the cancer the started of with, is pancreatic cancer, which is what my mother died of 3 years ago, which provides a third reason for me to love this video!

Update: I have uploaded JQ1 and the charged variant on ChemSpider to Ambit2 as dataset 976496, which means that you can look up toxicity predictions with ToxPredict for JQ1 on this page. This is what ToxPredict looks like, just before I hit Run all:

Wednesday, December 14, 2011

Google Scholar versus Web of Science

Web of Science (WoS) is the de facto standard for citation information. It's citation counts are used for many purposes, among which to decide I am a good scientist. Web of Science, however, really expensive, and Joe the Plumber does not have access. No wonder, he doesn't know which scientist to trust (...).

Recently, Google made their Scholar product open to all, allowing you to list your publications (about my list), which Google with augment with citation counts. If you search the web, you'll find much being said about the two, in particular compared with each other. One aspect is the accurateness of the citation counts, as people are afraid gaming, and random noise found on the web. Others would (counter)argue that Google captures a wider range of literature.

So, I was wondering how this would reflect on my impact. I know that WoS is not errorless either, and I have been making various support requests over the years (my WoS records still have errors). So, do they complement overlap? Are citation counts comparable. In fact, this turns out to be true:

I would be drooling if I got this kind of regression in my nanoQSAR studies! :) There is a very strong regression, indeed. One of the advantages of Google Scholar is does not select an elite group of journals (of course, they have to, because there data analysis process involves much more human curation), while Scholar captures newer Open Access journals, like the J. Cheminformatics, too. While I may be a bit of a non-typical scientist (some even argue I am not even doing science...), the overall outcome is that Google Scholar is actually more accurate about my impact than Web of Science is right now.

Thursday, December 08, 2011

Open Science and Non-Commercial licenses (a personal reflection to the Oscar/RSC controversy)

Peter has started a new line of discussion in his blog, referring to a correspondence with representatives from RSC last year, about an annotated literature corpus to (re)train the Oscar3/4 text miner. There are very many sides, and after I reread this post for a second time, I was still not 100% happy about all words: I can only try to express the complexity of the matter and how it started, but do hope to be clear that non-commercial licenses are not useful in Open Science.

I have taken part in parts of the correspondence Peter refers to, and I would not have written up things as Peter wrote up his impression of the outcome of that discussion, and at some point I seem to no longer have been included in the email correspondence, as I at least did not know the final outcome (see below), and cannot fully comment on the accuracy of Peter's coverage of that correspondence, but my impression on the outcome, as limited as it was, is not that far away from what Peter wrote up: Oscar4 needs training (doi:10.1186/1758-2946-3-41), and the RSC was unwilling to contribute the full text training corpus to the project without a non-commercial (NC) clause (and I explain below why I think this is bad). Oscar without a training corpus is useless; Oscar with a NC-licences training course is not Open Source (see below). As detailed below, the corpus at sentence level is NC-free licensed, and a lot of training can be done that way. Sufficient?

Peter wrote:

"I pointed out very clearly that CC-NC would mean we couldn’t redistribute the corpus as a training resource (and that this was essential since others would wish to recalibrate OSCAR). Yes, they understood the implications. No they wouldn’t change. They realised the problems it would cause downstream. So we cannot redistribute the corpus with OSCAR3. The science of textmining suffers again."
I do not know if it is factually correct that the RSC would not change (below we read they attempted), or whether the organisation really understood the problems. But, it certainly is a fact that we cannot redistribute Oscar4 as an Open Science project with a NC-licensed clause.

And, I want to add and stress here, that blog posts sometimes are just like press releases: things have the highest impact if written down in a black-and-white fashion; and getting things factually wrong happens to all of us now and then.

One of the outcomes I learned about this week, is that the RSC released the corpus in some form without the NC-clause. The full text paper corpus remained the NC clause of the CC license, but there is also a version where all sentences are released, and this has a CC license without the NC clause. I think this is not optimal, but still very much appreciate the gesture the RSC is making here, and would to kindly thank them for that! And do I want to make that clear too (thanx to Cameron for phrasing it so well in his comment), it is the principle freedom for the RSC to decide what they want to do, and I fully respect that.

Well, with that out of the way, and I wanted to say something about it, having been involved in the discussion, and feeling a bit in between Peter and the RSC here, appreciating both their view points, and having a third one myself, let's focus on this non-commercial clause a bit more.

Of we enlarge our scope a bit, away from written material, to Open Science, it is clear that the non-commercial clause is bad. In the Open Source world, organisations like the Debian project clearly state that non-commercial clauses violate basic freedoms. From an Open Standard point of perspective, this is pretty much the same. The reason, whether you like it or not, we live in a commercial world. Society expects us to me commercial, and any serious business is legally required to make making profit a company goal. Now, this effectively means that any science made available as non-commercial is not Open: you are effectively not giving people the freedom they need to advance science.

In short, a CC license with the NC clause is in fact quite like "yes, we love to be Open, but we are too scared". Now really, I understand this scare. I am a scientist, post-hopping around Europe, not tenured, and not being an experimental scientist, unlikely to become one. Don't tell me about risk and scare of making things Open. Yet, I did, and it payed of (not enough yet; still looking for a fixed academic position, as I already indicated). But in the more than 15 years I have been working now in Open Science, I have yet to find a compelling (or any) argument to back up this fear: the perceived risk of the NC clause has so far not proved any different than a fear of ghosts.

On the other hand, if I would not have been involved in Open Science, I would not have worked for the top European institutes I have been working in the past ten years.

So, what are the arguments for using the NC clause? The fear I understand, but arguments I do not see that support that a NC clause is useful in an Open Science setting.

Further reading:

Saturday, December 03, 2011

CDK-JChemPaint #9: implicit hydrogens and isotopes

Next in this series (after #1, #2, #3, #4, #5, #6, #7, #8), I'll show how to add implicit hydrogens to a drawing. I actually think the BasicAtomGenerator should cover implicit hydrogens, and the ExtendedAtomGenerator anything that requires more CDK modules than just the interfaces, like isotopes. But I discovered that implicit hydrogens currently also requires the ExtendedAtomContainer too late. In fact, there are other things I like to see changed, but I do not have the resources for that right now. So, you will need the CDK-JChemPaint jar (which is not the JChemPaint code!).

In fact, besides these points, it basically just comes down to replacing the BasicAtomGenerator with the ExtendedAtomGenerator. Except a bug I found. I'll fix that in the next release, but right now, the extended atom generator requires the AtomNumberGenerator to be loaded as well, and thus we also must turn atom numbering off. Therefore, we basically get this code snippet (here's the full code):

// generators make the image elements
List<IGenerator> generators = new ArrayList<IGenerator>();
generators.add(new BasicSceneGenerator());
generators.add(new BasicBondGenerator());
generators.add(new AtomNumberGenerator());
generators.add(new ExtendedAtomGenerator());

// the renderer needs to have a toolkit-specific font manager
AtomContainerRenderer renderer =
  new AtomContainerRenderer(generators, new AWTFontManager());

// disable atom number rendering
model = renderer.getRenderer2DModel()
model.set(WillDrawAtomNumbers.class, Boolean.FALSE)

As said, this code will be simpler in the next CDK-JChemPaint release. The results looks like:
As you can see by the amount of whitespace around the carbon, the scaling issue has not been resolved yet :(

Drawing isotope information works pretty much in the same way. In fact, we do not even have to change the rendering code, and the ExtendedAtomContainer automatically adds the isotope information (and no, indeed, not in the expected superscript fashion; so, another thing to fix):

But alas, there are always things to fix. I'm personally not aesthetically pleased with the kerning of just CH4 either.

Thursday, December 01, 2011

CDK-JChemPaint #8: rendering of aromatic rings

CDK can render aromatic rings in two ways: with localized double bonds and with a circle reflecting the delocalized nature of the π electrons. Or, graphically:

The following two code snippets are part of full scripts available from my Groovy-JChemPaint repository, and these two drawings are created with CDK 1.4.6.

To draw aromatic rings with localized double bonds, use this code:

List<IGenerator> generators = new ArrayList<IGenerator>();
generators.add(new BasicSceneGenerator());
generators.add(new BasicBondGenerator());
generators.add(new BasicAtomGenerator());

However, if you like the right aromatic ring style more, you replace the BasicBondGenerator by the RingGenerator, and use this set of IGenerators:

List<IGenerator> generators = new ArrayList<IGenerator>();
generators.add(new BasicSceneGenerator());
generators.add(new RingGenerator());
generators.add(new BasicAtomGenerator());

That's it. Here's the full script.