Thursday, February 24, 2011

Cleaner CDK Code #9: using the LoggingTool

A trick I used from Miguel Howard (of Jmol fame) is to not concatenate String until really needed. String concatenation will allocate new Strings, and thus burden the garbage collector (yeah, you lucky Fortran users who think twice before allocating memory :). Some time ago, he suggested this trick for the logger. I think in Jmol, but the CDK inherited that idea.

Don't concatenate Strings when calling the LoggingTool
Logging in the CDK can be turned off, and in production use, I often do. Therefore, we can make it speed up by concatenating debug message String only when debugging is turned on. The CDK LoggingTool can take care of this, so you should pass the Strings are individual parameters to the logger call. Instead of:

    logger.debug("the current atom is " + atom);

you should write:

    logger.debug("the current atom is ", atom);

And this works for any arbitrary number of Strings and objects, because the logging tool debug method has this signature: public void debug(Object object, Object... objects). So, you can also just do:

    logger.debug("The ", i, "th object is connected to the ", j, "th");

Actually, looking at the debug() method's signature, I am not sure why we still have the first Object separately... mmm.

Adding a CDK atom type

A very quick, short post on CDK atom types. These atom types are used by the CDK to decide how many missing hydrogens an atom has, or how many lone pairs, and if the atom can be part of an aromatic ring system. The CDK code basically consists of three parts: the atom type ontology, the atom type perception code, and the rest of the code that uses information from the ontology.

Most of you have already developed a love/hate relation with the CDKAtomTypeMatcher. This is the class that complains about atom types not being recognized. And let me stress this one more time. When you get such a method, it means that the CDK cannot add hydrogen, it cannot calculate QSAR descriptors, etc, etc. I have had the comment that CDK 1.0 never complained about that. True. It just ignored the fact that it did not know what to do with that atom, and calculated (wrong) properties anyway.

The solution to such a missing atom type warning can be two-fold. First, the input data was wrong, for example, because formal charges were lost at some point. A typical example is a neutral four-coordinated nitrogen. Second, the input data is correct and the CDK is wrong. This doesn't happen too often, but it does.

This last is the topic of this post. The CDK can be 'wrong' in two ways. Either it doesn't know the atom type (CDK 1.2 and 1.4 have more atom types than CDK 1.0 ever had; so), or the perception algorithm makes a mistake. The latter is not uncommon, particularly for some elements, like nitrogen. The cause here is the input data, or better, the lack of input data. Missing bond orders, missing explicit hydrogens. Now, cheminformatics can make educated guesses, and so the perception algorithm does. But it depends on its education.

Doing the needed education of the CDK basically looks like this commit. It adds:
  • a new atom type to the ontology,
  • adds code to the CDKAtomTypeMatcher for the perception, and
  • adds unit tests to make sure it does the right thing.
Mind you, that the ontology needs to provide the following properties of an atom type:
  1. the element (duh),
  2. the formal charge,
  3. the number of bound neighbors,
  4. the number of double bond equivalents (piBondCount),
  5. the number of lone pairs, and
  6. the hybridization state (sp3, sp1, etc).
The current unit tests focus on the true positives, but I got a report about a false positive this week, which I will add shortly. Also check these posts on atom typing and atom types.

Monday, February 21, 2011

CKAN and RDF: a nice example of why ontologies matter

For a while now I have been a so-called invited expert to the Linking Open Drug Data (LODD) task force of the W3C's Health Care and Life Sciences Interest Group (HCLSIG). I also participate in the open-science group of the Open Knowledge Foundation (OKF). This is not really worth blogging about if the two are not being mashed up. Members from both sides are interested in learning how Open (think Is It Open Data?) the open data from the LODD network really is.

In fact, the Open Data definition as outlined in the Panton Principles does not allow for a non-commercial clause, which several LODD data sets are labeled with. Clear copyright and license statements are really important. This applies to source code, but most certainly to data too.

So, early March there will be a virtual, online hack session on making a start clarifying the licensing and copyrights of the various LODD data sets. CKAN, the OKF's registry of data sets sounds like a suitable place to organize things. And, they support RDF making it even a nicer match. Or...

Records, data sets, packages, lists, ...
Well, I am running into a number of problems that need to be solved. The first one is, in fact, rather fundamental. A record in the CKAN catalog is typed as a CatalogRecord. Nothing more, nothing less. However, looking at the database content each record is like a data set. The GUI confirms that with links like 'Add a dataset'. Then again, following the link, it is described as a 'data package' as well as a 'dataset'. This is confusing.

Yes, it is! Really! Just look at how people are using it.

Take for example the record for LODD (well, there is a second LODD record and I haven't found a way to delete records). But, the LODD data is not a single data set, but a aggregation (package?) of data sets. This is basically one of the standing issues with Linked Open Data: license incompatibilities, just like with mixing GPL v2 and v3 in source code. As such, these records which are basically lists of datasets typically not have a license listed, as multiple license apply.

There seem two solutions: the use of groups and of tags. I opted for a new W3C's HCLSIG LODD task force group.

Small issues
As said, I could not find a way yet to remove datasets, so cleaning up the catalog is a bit difficult. You can add comments, but I am not sure if these get read. Earlier, I noted that the GNU FDL license was missing, but that is available from the list now, so I have updated the NMRShiftDB record. The combination of Attribution and Share-Alike for the Creative Commons license is missing from the list though, affecting the ChEMBL record.

Another issue that must be addressed in CKAN is how to deal with redistributions. For example, the above cited ChEMBL data is also available as SPARQL end point by me, and this end point currently has a different record. Should those records be merged? I guess not, because there is only place for one maintainer. So, perhaps a CKAN catalog record is not even a dataset, but a data set provider?

Well, this does make a really nice example of what can go wrong if terms are not well-defined, e.g. using an ontology. People do not know how to fill the database, leading to noisy content, limiting the usefulness of the data. I do hope we get to see definitions for what groups and datasets are in the catalog before our hack session in March.

Thursday, February 17, 2011

OPSIN used for a Bioclipse wizard

Somewhere in January I added a new New Wizard to Bioclipse for OPSIN, but forgot to blog about that earlier, which will be available in Bioclipse 2.6 later this year (or just the hudson build service).

Posted via email from Egon's posterous

Science 3.0 needs a facelift before ...

Science 3.0 needs a facelift before I can switch away from FriendFeed:

It comes down to the fact that the website has way too much information, and uses way too much space for things that do not matter. The interesting content only starts half-way my screen. I tend to have given up finding the interesting bits by then.

But even within each thread there is room for improvement. There too, there is abundant whitespace, though that might prove functional when the layout becomes more compact. Also, I life the photo of the people whom I follow to be removed (something that is done with a userscript only on FriendFeed too!). Instead, it should be clear where to thread comes from. (Regarding that, I am not sure if any RSS feed you can now list, will actually show up in this activity stream yet.)

"Fill in the #Wikipedia Survey and Help Our Community"

Fulfilling a request by Daniel, following Antony's title, and adding a random link to a Wikipedia molecule just to see it show up in Chemical blogspace now that I fixed the InChI detection, I hereby invite every scientist to fill out this questionnaire, surveying how scientists use and do not use Wikipedia to spread knowledge.

As a nice spin-off if gives you a star rating for how well you are into the social science networks:

I got eleven stars, but that should not surprise you.

Monday, February 14, 2011

Things are looking blue... (for #bioclipse)

Last Friday a virus kicked in. Don't know whether flu or cold, but it had already ruined my 5-6 February weekend too. I need high fever to keep bed, otherwise I just can't. Worse, when having an elevated temperature, I cannot think clearly anymore, and I haven't been able to think clear since about last Wednesday, so I should have seen it coming. Nor can I think clear today, leading to incomprehensibly question on BioStar, and the general lack of progress in writing and other important things, like planning.

Instead, I procrastinate. This is not necessarily bad, because my favorite procrastination is open source cheminformatics. But, it's not very efficient for getting tenure. I wish I could procrastinate in writing review papers for Nature Foo.

However, it does contribute to the CDK and Bioclipse. And building on the seminal work by Arvid++ on setting up Hudson for Bioclipse (building on := stealing his scripts and configuration files), I applied the genetic algorithm-like approach well known in IT: change everything one by one, and keep the best solutions. And, to my excitement, things are now looking blue for Bioclipse :)

Posted via email from Egon's posterous

Sunday, February 13, 2011

Bioclipse upgraded to CDK 1.3.8 and CDK-JChemPaint 17: builds broken

Yesterday I did some boring plumbing: I upgraded our GC/MS machines with new hardware. Well, the cheminformatics equivalent of it anyway. I upgraded the Bioclipse bundles in org.openscience.cdk for CDK 1.3.8 and CDK-JChemPaint 17. This is typically a painful process, and now even more because a lot is changing with how Bioclipse is build, which is with Buckminster on Bioclipse' Hudson server. I am happy that Arvid is doing the digging into Buckminster, as automatic building of Eclipse-RCP tools is straightforward. Each part is describes by, seemingly, at least five configurations files. For each plugin we have plugin dependencies, plugins are wrapped into features, where you can get dependencies wrong too, and update sites, which need to be in shape too. And on the Buckminster-side there are .rmap and .cquery files pointing to the right git repositories, each of which you can get wrong too.

Well, and I got them wrong. A few things had to happen. First of all, I had to revert the JNI-InChI library to a plugin, as I have no experience with making OSGi bundles out of random jars (there is this bnd tool which exists for that task, but never used it before). But, the upgrade brought in version 0.7 which should solve the InChI library loading issue, as well as bring InChI to various more platforms. So, the JNI-InChI jar was pulled from the target platform, and the rebuild worked out of the box.

The second thing I did was upgrade the org.openscience.cdk bundles. This repository had three branches, one for pure CDK, now updated to CDK 1.3.8, a CDK-JChemPaint branch updated to version 17, and a bioclipse2.4 branch for Bioclipse 2.x (we now and then patch things to make it work, so that we do not have to wait for them to be applied upstream, with the CDK itself). I thought it would be good to create a separate bioclipse2.6 branch for Bioclipse master, but that was a mistake. This branch information is scattered over the .rmap and .cquery files, and they are not all in one single git repository, and those that are, are not synchronized with the versions on Hudson either. It's normal in prototyping, but not helping me. After a couple of hours fiddling, I got this one compiling too.

Right now, I'm stuck with getting bioclipse.cheminformatics compiling again, and looking for ERRORs in an enormous log file. If only ERROR messages were a bit more descriptive, something like "Hej, you said to look for org.openscience.cdk.sinchi in file, but I cannot find it anywhere.", then I'd have some clue where to look.

That makes me wonder if anyone actually ever wrote up best practices for error messages...

BTW, I also still have another, independent dependency issue to solve: the one for Google's Guava, so that everyone can play with the Bioclipse-Google interaction.

Wednesday, February 09, 2011

Chemical data curation: yes, it is that bad.

The readers of Antony's blog know enough about the problem. And many in the QSAR community know it too (and many other do not). Chemical structure data is noisy. I haven't recently created a new local data set for analysis, so I have not taken time to blog about it much, but the ambiguity in chemical databases is enormous. Just yesterday, Antony and I had a good discussion about tautomers and in particular how things are linked together.

If we are in the field of property prediction, knowing what tautomer to calculate descriptors for is crucial. Not that we actually have easy access to experimental data showing what the important tautomer is for our end-point (predicted property), but at least we can track what tautomer we modeled with. Has everyone ever asked you to add units to experimental values? Like "the temperarature was 279 degrees; Celsius or Kelvin??" Well, this is the exact same thing. If your QSAR model training report does not include that information, you are doing it wrong. (/me ducks)

So, why does it in fact matter? It matter simply because calculated properties are different. Backing up to the ChemSpider example in my question about InChIs with the fixed-hydrogen layer I noted that (like in many other databases) the synonyms seems to include IUPAC names for at least two tautomers. However, while the ChemSpider is, in fact, for the tautomer-independent structure (using the InChI mobile hydrogen layer; and keep in mind that the InChI uses only a limited amount of heuristic rules for identifying tautomers, making it not detect all 40 tautomers of warfarin), the 2D diagram, the 3D model, and the calculated properties reflect only one tautomer.

And calculated properties are exactly the input in QSAR's statistical modeling. It is interesting to realize that the differences in calculated molecular descriptors can vary both minimally, or not at all, as drastically. Very drastically, in fact. The recent paper by Porter (doi:10.1007/s10822-010-9335-7) shows the 40 warfarin tautomers, and discusses a few properties, such as the pKa. The experimental pKa of warfarin is around 5. Now, the paper reports calculated pKa values for a variety of software products (AMBIT is unfortunately missing). First of all, it shows that the various tools differ, which is to be expected. But that variance is neglectable when compared by the effect of picking the wrong tautomer. I was impressed by the range of predicted values for the various tautomers. I ranged from about 5 to 12, throughout all tools. That means warfarin is predicted to be mildly acidic (some tools predict pKa's down to 2.5) to very basic! No way your statistical modeling will understand that!

And this is why Open Data is so important in chemistry. So, the next time Joe (Organic) Chemist bitches about computers and cheminformatics, tell him it is his own fault: he should have released his data out in the Open.

Anyway. Tautomerism was a curation issue in the first(!!!) entry I was curating. The sixth had the more well-known problem, I think. I may be blind, but I would say this drug has a stereocenter:

But none of the databases I checked so far (including ChemSpider) defines the stereochemistry! I thought we settled that some decades ago? Stereochemistry of drugs matter. What is going on here? I guess I have to browse some primary literature and access some experimental data today then. If I can afford it.

Porter, W. (2010). Warfarin: history, tautomerism and activity Journal of Computer-Aided Molecular Design, 24 (6-7), 553-573 DOI: 10.1007/s10822-010-9335-7

Sunday, February 06, 2011

Groovy Cheminformatics...

Update: the fourth edition is out.

Some project are never finished. Neither is this one, but it is never too late to change how things work, so, taking advantage of publishing-on-demand, here I introduce the release-soon, release-often equivalent of cheminformatics books, my Groovy Cheminformatics with the Chemistry Development Kit book:

With a serious discount for just being the first edition (1.3.8-0), but still counting at 72 pages with 75 code examples, this edition marks a personal milestone (and probably not much more than that). There remains much to do, but I promised a release by tomorrow, so here it is. Next releases will contain more code examples, more functionality descriptions, and more literature reviewing where such code is used in science. The plan is to make new editions with each new CDK release, as well as new editions when I added a new chapter, section, or just paragraph. But, there will not be a Nightly build service anytime soon.

The current table of content is as follows:

Now, the book content is not open content. However, it contains nothing that is not available in other means. It's just the compilation that makes this book interesting, as well as that I put effort in ensuring the code examples remain working. For that, I ask a minor financial contribution.

Wednesday, February 02, 2011

Accessing Google Spreadsheets from Bioclipse

I am almost done refactoring two year old code for the Open Notebook Science Solubility project, which converted its data from the Google Spreadsheet into RDF, for the Beautiful Data book chapter.

Posted via email from Egon's posterous

Open Chemical Data #2: Dryad

Nico asked recently about the availability of chemical data as RDF. There is no so much really. Finding large amounts of Open Data in chemistry is in general a problem. Things are slowly changing, however, though it is not very apparent.

About 1.5 years ago I started a FriedFeed room Open Chemical Data where RSS feeds of new chemical data are aggregated. It started with the RSS feed from the NMRShiftDB, and later ChemPedia Substances was added (which is expected to go EOL). I was informed, however, about Dryad. It is a website that allows deposition of data published in journals, as CC0. That latter is amazing! And when they announced a new twitter feed, also noted they had a RSS feed of new data sets. I asked about feeds specific for chemistry, which prompted Ryan to set up this Yahoo Pipe.

So, Dryad is now part of the Open Chemical Data room (second item):

Integrating CiteULike in LaTeX-based paper writing

Just a quick note. I started using CiteULike as replacement for JabRef (I hope that it will abstract the backend interface, allowing CiteULike to act as online store), and picked up an idea from someone to use groups for individual papers I am working on. And, I extended my Makefile (Marcus, yeah, I know I should use cmake :) to download the references in the group in BibTeX format:
        @echo "" > book.bib
        @wget -O - >> book.bib

BTW, does anyone have experience with the new biblatex?

Tuesday, February 01, 2011

DIY: Running R from within Bioclipse

Important: this blog post is about a development version of Bioclipse, not about the stable 2.4 series. See also Gilleain's comment for this post.

Arvid has done an excellent job in setting up Hudson for Bioclipse, which is not so trivial as for Maven projects. With the R functionality I blogged about earlier today moved to the bioclipse.statistics repository, things are automatically build when new patches hit the repository. And Hudson doesn't just compile the code, it also exports binaries. The main Bioclipse products build can be downloaded here (Windows, Linux, OS/X; if you need more, please let use know). And, all the additional features got automatically uploaded to individual update sites, very much like Maven projects getting pushed to a Maven repository.

So, here's an Bioclipse-R install guide for the adventurous (i.e. Christian, Rajarshi and Rebecca, perhaps Juuso :). I will describe it from a Debian GNU/Linux perspective.

First, make sure you have R and rJava installed. I strongly recommend the versions from sid/unstable (2.12 and 0.8-8), which are more stable here than those from testing:
$ apt-get install r-base-core/sid r-cran-rjava/sid
Then you download the nightly build of Bioclipse 2.5.x from Hudson, e.g. for my 32bit Linux:
$ cd /tmp
$ wget
$ unzip
$ cd Bioclipse.2.5.0.linux.gtk.x86/
But before we boot Bioclipse, we first need to tell it where to find the R instance, and we need to set two variables for that: R_HOME and java.library.path. The latter is set via the bioclipse.ini file, by adding this line to the end:
The other is an environment variable and can be set with:
$ export R_HOME=//usr/lib/R
Then Bioclipse can be started:
$ ./bioclipse
The first thing to do is install the R Feature. To do this open the Install New Software... dialog from the Help menu, and add (with the Add button) the Bioclipse Statistics update site on Hudson at this Location (copy/paste the Location URL into the below dialog):

When back in the Available Software dialog, select the new update site, and mark the Bioclipse-R Integration feature to be installed:

After the install is completed, which will require a reboot of Bioclipse (why?!?!), you can open the R console by opening a dialog from the Window -> Show View -> Other... menu:

The R Console will then show up as view, as depicted in my earlier post.

Loading of libraries, like running library("rcdk") or library("pls"), crashes the whole thing if you use R and rJava from Debian testing, and silently fails with those from unstable. It's on my agenda.

Running R from within Bioclipse

One rarely know feature of Bioclipse is that you use it to run R code. This functionality has been around for a while from experimental, and recently updated to use rJava by Carl. In the next months we will improve passing around data between Bioclipse and the R session, providing a richer chemometrics experience. And, with a bit of luck, it will be part of the upcoming 2.6 release.

Posted via email from Egon's posterous