Wednesday, November 23, 2011

Keeping it cool... tracking CPU temperatures on Debian GNU/Linux with a 3.1 kernel

My laptop is getting old (almost 12 months now), and is starting to see the first symptoms of age. Probably just dust piling up, but I have been experiencing CPU overheating. A week or too I found a nice one-liner to keep an eye on the CPU temperature, but a kernel upgrade to 3.1 broke that. Here is a script that works on my Linux laptop:

( cd /sys/class/thermal && while :; do line="`date`:`cat */temp | cut -c1-2 | awk '{ printf(\" %03d\", $1) }'`";   echo "$line";   sleep 5; done ) | tee LOG

Wednesday, November 09, 2011

The simplest way to make CDK commits

Every now and then I people who show interested on working on the CDK. I reply to them what is involved, and I rarely here back from them. I know this is common for most open source projects (see also Community development), and for the CDK this is likely caused the cumbersome process of getting a full development environment set up. Over the next months, I will make an effort to extend my Groovy Cheminformatics book to include detail after detail on how to do this. But what would also be welcome is a VM (OVF) image that has everything set up and well.

Anyway, but the road to CDK commit fame does nowadays not require a full-fledged development environment. Instead, we have GitHub. Their web interfaces makes a lot of things easy, including source code peer review.

But in this post I would like to show how easy it is to fix small things in the CDK, by using the GitHub GUI. Of course, this post can be used for any project hosted on GitHub.

Step 1
Get a free GitHub account. (And log in.)

Step 2
Find a problem in the CDK. Start with something dead easy, like JavaDoc errors. For example, check the Nightly report for OpenJavaDocCheck errors here. These pages will return a lot of errors about missing documentation, but skip those. Do something really simple, like reports like this one:

There is no period to end the first sentence: 'Sums up the columns in a 2D int matrix'

JavaDoc has a special purpose with the first sentence in any JavaDoc: it serves as a summary. The detect the first sentence, it must properly end with a period.

That patch cannot get any easier. It just requires a missing period to be added.

Step 3
Identify the source file that contains the error. This has the added value in that you automatically learn your way around in the directory/folder hierarchy of the CDK project source. The above error refers to this class:


Now, all functional CDK code (that is, everything but the unit test suite) can be found in the source distribution under src/main, but we need the GitHub URL for that, and that is here (note that the linked OpenJavaDocCheck report is for the stable cdk-1.4.x branch, so our GitHub page for the PathTools source too):

Check this URL carefully, and note where it keeps the branch name, the src/main folder, and the path to the source. That makes finding other source code pages later easier. This particular page looks like:

Step 4
Now, this source code page has (when logged in) a 'Edit this file' icon right of the file name line. Click this icon, and GitHub will present you with a basic, in-browser editor:

I already scrolled down a bit, to the line with the missing period from this example. Make the modification, and scroll down to the lower part of the page, and read step 5.

Step 5
With the small fix done, it is time to make the actual commit. Below the editor there is a text field to enter a commit message (important: describe what you did, even if this takes more time than the fix itself! Reason: when browsing commits in changelogs, you only see those messages!):

If you have multiple JavaDoc fixes, put them in one commit. But, preferably do not mix them with other fixes, as to keep the commit message as well as the peer-review simple. That speeds up the reviewing process, and makes it easier for me and Rajarshi to apply to the main source tree, but more about that in the next steps.

Of course, this online editing can also be used for fixing PMD warnings, as reported by this Nightly report. However, keep in mind that you cannot recompile the code this way, and for code changes, this online approach is discouraged.

When done, press 'Propose File Change' (Rajarshi and I see a 'Commit Changes' button instead). After a new page is opened, the commit has been created, and it is time to inform us of your commit. This is done via a so-called 'pull request', as outlined in the next step.

Step 6
The last step in the process is to send out a pull request. A page to do this is normally the immediate result from hitting that 'Propose File Change' button, and should look something like the following (note that I could not make a screenshot based on the running CDK example, because I have commit rights, and the patch goes directly into the repository; I discovered that in this patch :):

So, while for another GitHub project (Total-Impact is worth checking out), this page should look similar. The top grey bar show the project name and the 'Send a pull request', confirming that this page does what we are expecting. In the blue box a comment is given on where your commit is stored, which is in your own fork of the CDK for your own GitHub account, in a branch called patch-x.

Below that blue box, reference is made to your newly-made commit, and a bit further below two text fields, a single line text box for a message 'subject' prefilled with the commit message, and a text box where you can leave a message to accompany the pull request. This message is used to put the pull request in perspective, and can be used to introduce yourself briefly, refer to a set of patches, or whatever. This message will not end up in the git repository. The more requests you make, the smaller this message will get. "Yeah, another JavaDoc fix."

Hit the green 'Send pull request' button, and you're done.

Saturday, November 05, 2011

Online SEURAT workshop: Omics data analysis for Toxicology

Another online meeting announcement (previously on Open Data and LarKC). BTW, DAWG is short for Data Analysis Working Group:

Omics data analysis for Toxicology
Tuesday November 15th 2011, 15:00 CEST, 6:00 PDT

(organized by ToxBank & SEURAT-1 DAWG)

Storing in the ToxBank data warehouse and sharing it among SEURAT-1 is not the only goal of a omics data integration effort. Combing omics data sets available from the data warehouse will provide knowledge not visible from a single data set. The ToxBank platform aims at making these kinds of omics analysis possible.

We invited Prof. Roland Grafström to present the omics data analysis work in cancer genomics for a seminar to speak on their recent work in the field. Prof. Grafström is partner in the ToxBank project and is associated with the Karolinska Institutet medical university in Sweden, and the governmental research institute VTT in Finland.

The presentation will highlight the interpretation of gene expression data from the application of a combination of bioinformatics tools, including the Ingenuity Pathway Analysis software and the Gene Ontology. Basic concepts and terminologies will be dealt with including for integration of omics data, in vitro to in vivo extrapolations, as well as retrieval and validation of biomarker genes in large data sets. Work aimed at tumor biomarker discovery in head and neck cancer will be presented, but the results will discussed in the context of the work planned in the SEURAT cluster.

The meeting will be held online and will be organized via GoToMeeting. There are only 25 seats, so registration is important, as we will fill seats on a first-come basis. However, one ‘seat’ can host multiple scientists behind a single computer, if needed.

Please send an email to Egon Willighagen ( listing your name, SEURAT-1 project, and email address. Details on how to log in will be send to that address shortly before the meeting.

Wednesday, November 02, 2011

Going to Maastricht to work on Open PHACTS

Some two months ago we decided to go back to the Netherlands, after having lived for more than three years here in Sweden. We have had a great time in our three houses, but feel a need to settle down, closer to family.

A week later I was contacted by Chris Evelo contacted me for a position to work on Open PHACTS. Chris and I only met for the first time in March, when I visited the group when I had a conference in Maastricht. His bioinformatics group is very much into Open Science and with a good track record in metabolism and transcriptomics analyses (something we do here at KI too), and Open PHACTS is an interesting EU project into the application of semantic web technologies to the life sciences, something I have worked a lot on in the past two years.

In the next two months here at KI, I'll be working hard on finishing my work for the other great EU project, ToxBank, on which I am working now. I personally see clearly how these projects complement each other, but no clue if such can be given shape at a EU level, where there is intention and consortium agreements :)

And, of course, I got a bit of funding rewarded here at KI, that I will use next year for two or three visits back to Stockholm, because there is some low-hanging fruit that remains to be picked.

All in all, I am very much looking forward to my next post-doc position and definitely my last. What's after that, I won't have to care about in at least the next two years :)

Tuesday, November 01, 2011

CDK 1.4.5: the changes, the authors, and the reviewers

CDK 1.4.5 just got uploaded to SourceForge, about a month after the 1.4.4, though mere minutes after the 1.4.4 release notes. CDK 1.4.5 is the fifth bug fix release of the 1.4 series and brings another few bug fixes.

The changes include fixes to the JavaDoc generation, now outputting proper citations of PhD thesis and books, a fix in the SDFWriter to inherit the IO options from the underlying (used) MDLV2000Writer, restored atom type perception in SMILES parsing if aromaticity is not actively perceived (in line with earlier 1.4.x behavior that unfortunately got broken due to another fix), a fix in the MDLV2000Reader to deal with pseudo atoms with numbers greater than 9 (thanx to John May!), a fix in the sorting of IAtomContainers, and a fix for the elusive bug in the AWTRenderer causing thin bonds (e.g. due to zooming) to become grey.

The changes
  • Use a minimal stroke width in the AWT output (fixes #3295256) 1387b7b
  • Changed how data files are copied: copy those specified in the src/META-INF/*.datafiles (fixes #3430342) 68536e1
  • Perceive atom types also when aromaticity from the SMILES is kept 2413197
  • Added unit test to make sure atom types are also perceived when aromaticity from SMILES is kept d6b8c8e
  • Removed broken link and fixed syntax of @cdk.cites. 22c59fc
  • The SDFWriter now accepts all MDLV2000Writer's IOSettings too (fixes #3392485) aa6ca5f
  • Unit test for bug #3392485: SDFWriter not accepting MDLV2000Writer IO settings ab54ed5
  • Fixed @TestClass annotation to point to the correct class a7cbe48
  • Updated unit test to be independent of atom index and just match the coordinates 8d3eed0
  • Fixed bugged when reading MDL V2000 files. If the atom number of a pseudo atom was greater then 9 it would not be read correctly fd90ed5
  • Depend on standard to, to have access to AtomContainerComparator 76db769
  • Create IMolecule's instead of IAtomContainer's because IMoleculeSet can only contain the former; fixed the exception currently thrown by e.g. MoleculeSetTest 1aae7fa
  • Fixed sorting with null IAtomContainers (based on a suggestion by Mark in the bug report #3093241) 1bf194b
  • Missing unit test for AtomContainerSet.sort(Comparator) 0585bcd
  • Added a unit test to check that molecular descriptors do not throw Exceptions when disconnected structures are passed 853ca50
  • Added support for book and phdthesis reference types 277dd03
The authors
17  Egon Willighagen
 1  John May
The reviewers
10  Rajarshi Guha 

CDK 1.4.4: the changes, the authors, and the reviewers

CDK 1.4.5 just got uploaded to SourceForge, and when I looked up the link to the notes for 1.4.4 I noted I had not released those notes yet. So, here are those first.

CDK 1.4.4 is the fourth bug fix release of the 1.4 series and brings another few bug fixes. The changes include new cobalt atom types, a fix to ensure that all molecular descriptors are properly recognized by the build system, and a fix for the Log4J-based LoggingTool, to properly handle nulls. So, not so many changes, which probably explains why I forgot to blog about it earlier.

The changes
  • Co final abf6b33
  • Fixed adding all descriptors to the qsar-descriptors.set file by using the correct number of chars to skip (@cdk.set length is 8 not 11). a75eece
  • Provide debug info on the classpath in which will be searched 88070ac
  • Check for a null input 5cd07d0
  • Unit test for a NullPointerException in the LoggingTool caused by a null message in an Exception. The error only shows up when debugging is turned on. 1e07adc
  • Split up to test the LoggingTool also with debugging turned on; so far, only with debugging turned of was tested 58bb243
The authors
 7  Egon Willighagen
 1  Gilleain Torrance
The reviewers

3  Rajarshi Guha 
1  Egon Willighagen

Oscar4 paper: text mining in Bioclipse (and everywhere else, of course)

The Oscar4 paper (CC-BY, just like the screenshots of the paper below) was out already some days now, but the formatting has finished:

I spotted a rogue 'http://' in the code example b) in Appendix B:

I'll see what I can do about that, but the API might evolve a bit anyway.

That leaves me to mention that Bioclipse has an Oscar extension (Bioclipse has a lot of functionality nowadays, in fact), and that I blogged several times on Oscar4 when I was working with the other authors on the refactoring last year.