Wednesday, July 28, 2010

CDK 1.2.6: the changes and the authors

Like all release in the 1.2 series after CDK 1.2.0, release 1.2.6 is a bug fix release. Anyone running a CDK 1.2 version is advised to upgrade. New in this release is the availability of a torrent for the cdk-1.2.6.jar (see BitTorrents for Science). Please find below the changes and the authors that contributed to this release.

The changes
  • Updated the DebugBond unit test too now: new DebugBond() has zero atoms 6ef1fb1
  • Backport patch, to make the patches compile with cdk-1.2.x ce9b1bd
  • Additional patch to reduce atom count on setAtom(null, int) and unit tests for the setAtom(IAtom, int) behavior. 6cf95db
  • Also fix the new NNBond() == 0 atoms for the nonotify module 23d9685
  • Fixed Bond() constructor to create a bond with zero atoms. Also fixed setAtom(IAtom, int) to increase the atom count if a null entry is filled with a non-null IAtom. 916ab96
  • Updated test to assume new Bond() creates a bond with zero atoms 4ac7111
  • Exceptions when clone atomless ISingleElectron and ILonePair too c5d4cd3
  • Unit test for ArrayIndexOutOfBoundsException occuring when trying to clone an IAtomContainer with an IBond with no IAtoms d71c31c
  • Added unit tests for SMILES with failing atom typing, from email on the cdk-devel mailing list June 11 2010 890d0f5
  • Added the N.oxide atom type, for structures like (CH3)N=O bb431e3
  • Fixed reading of SD properties: keep the first line too 3133a18
  • Added missing dependency 14003dc
  • Fixed unit test: surely there is no atom with symbol 0... how long has this been failing?? c888e4f
  • Added a test class to aromaticity of three compounds: the last incorrectly fails a454ab8
  • Also except N.amide as part of an aromatic ring 3357113
  • Added a test class to repeat atom type perception and test consistency 0bd5b42
  • Unit test fix: the molecules *is* aromatic, as we should assume it is. Fixes a big goof up cd83236
  • Replace special chars where spaces are supposed to occur, fixing the fail of the unit tests every now and then 483c856
  • Improved javadoc generation using a link tag, so that references to java library classes are resolved properly 1f8bb2d
  • Removed use of the proprietary DocCheck utility 1523f66
  • Use the new tests in more situations e8b13b2
  • Introducing PMD test for CDK specific issues: 406930b
  • Added copyright and license header d6b6c65
  • Replaced outdated URL with entry in WikiPedia (fixes #3002741) ad2bd3e
  • Fixed a ClassCastException in a unit test; I messed up (mea culpa) b5fa3dc
  • Fixed NullPointerExceptions for LonePair's and SingleElectron's constructed with the no-argument constructors 5f34897
  • Added missing cloning of single electrons 2d4c122
  • Do not try to clone the atom if it does not exist 9672df0
  • converted uses of indexOf to startsWith/contains 7b9d84e
  • Updated HIN reader to fix bug 2984581 f95c632
  • Added unit test to see of arrays are properly cloned, and that array entries of the original are not overwritten 38d5f8d
  • Unit test that the IAtom[] array is properly cloned, and overwriting entries in the clone does not overwrite entries on the original 3c1b07e
  • Removed duplication of cloning. 216c160
  • Apparently the super.clone() does not clone the pointer to the IAtomContainer[], causing a clone() followed by changing containers in the clone to overwrite the original IAtomContainer[]. Fixed by creating a new array. 4e5d6a1
  • Moved test from the specific class to the abstract tests, as the behavior should be the same for NNMoleculeSet and DebugMoleculeSet too 068fb3b
  • Two more tests for the issue: atom typing works fine; aromaticity detection fails: one ring is detected as aromatic (that with two nitrogesn), so that it does not consider the double ring, marking the other ring as non-aromatic 3be2367
  • Fixed taking into account larger ring systems when one ring is in itself already aromatic (fixes #2976054) 891049f
  • Fixed cloning of properties with null values by always using HashMap (fixes #2975800) 2f722f0
  • Added four and six coordinate neutral platinum atom types. 407d793
  • Shortened the SMILES to only contain the aromatic atoms, allowing a foreach loop: replaced for-loop by a foreach-loop, solving also the not testing all atoms in the testAromaticty() test. 6fcc3d0
  • Added InChI, and link to existing pyrolle test, using a different SMILES f088cd6
  • Added tests for two cases of aromatic rings c26ae95
  • Added @cdk.bug annotation, and restricted testing to the bug 3b08f1a
  • Removed try/catch to retain the stacktrace of where the NPE occurs ce11b52
  • Test checking for NPE when cloning with property with null as value 0aa632a
  • Improved JavaDoc: 19a976c
  • Loosened the perception of N.planar3 atom types: the Hueckel system consist of more than one ring, so looking just at the ring to which the atom belongs does not make sense d651c25
  • Added unit tests for atom type perception of more N.planar3 atom types 83423a7
  • Removed unused import e1c03fb
  • Removed last bits of implementation details from the API: now uses List<> instead of ArrayList<> 7727b72
  • Removed output to STDOUT 14e1d12
  • Fixed some spelling errors and added JavaDoc links 677b3f6
  • Synchronized behavior with the MDLV2000Reader (addressing bug #2942196) 2ceef95
  • Ant has a release 1.8 that should be accepted in build.xml 4398cc4
  • The IMapping interface had a class comment which probably was a copy&paste artefact. Changed this. 05c857c
  • Fixed license info .meta file for JavaCC d9e15bb
  • Bumped version to differ form the 1.2.5 release 17a6f08

The authors
The below numbers are based on the number of commits, but keep in mind that some developers, like myself, need more commits for the same number of changed lines.
49  Egon Willighagen
 3  Rajarshi Guha
 2  Stefan Kuhn
 1  Arvid Berg
 1  Mark Rynbeek
 1  maclean

The reviewers
The below list is based on who signed off the patches. Anyone who reviews patches in the patch tracker can basically do this. Ask on cdk-devel on how to do this.
41  Rajarshi Guha 
 3  Egon Willighagen 

Saturday, July 24, 2010

An Ubuntu Blue Obelisk meta package

There was some talk recently about Blue Obelisk software available as Ubuntu / Debian packages. This morning I had trouble waking up, so hacked up a metapackage, so that you can now do:
  sudo add-apt-repository ppa:egonw
  sudo aptitude install blueobelisk
Currently, it installs BODR, Kalzium, Gnome Chemistry Utils, Chemical-MIME, OpenBabel, and the CDKIdeas, feature requests, patches, etc, welcome via GitHub.

Tuesday, July 20, 2010

My worst cited paper: supervised self-organizing maps

Amazing! It actually is the paper (doi:10.1021/cg060872y) with the best graphics! Is it really then, that scientists do not care about looks?! Or is this just the curse of Closed Access publishing?

Ping me if you have a nice data set with multiple dependent variables and need some statistics advice!

Monday, July 19, 2010

Script logs as HTML+RDFa: mix free text reporting with CSV

Richard (Talis) wrote up a three-step tutorial on how to publish your data. I think I would be more than happy if scientists reached step 1. Related, Ola asked me a while ago if I was interested in using the computing facilities of UPPMAX, and I was. But until this weekend I did not have the time or energy to give it a spin. If you are puzzled how the heck I see those two items related, read on :)

Two days later, today, I ran my first analysis. Still a test run, but using the CDK to perceive atom types on the first 2.5 GB of PubChem data. The full data set is now 80 GB, and I will start doing this analysis today. You might remember this already two years ago (see Wicked chemistry and unit testing) for a small subset, but only now have the power to analyze all compounds. The UPPMAX system I work on has 348, each with 8 cores. Each core has 3 GB of memory, but I am using the IteratingPCCompoundXMLReader class anyway. Analyzing the 2.5 GB of data was done using 50 nodes, and finished in about a minute. Nice :)

Now, this first run dumped the results as a plain text file, looking like:

CID 200234: Ti  1
CID 200235: Ti 1
CID 200237: Sb 1 Sb 2
CID 200365: S 1
CID 200761: Hg 1
CID 201374: Ce 1 Ce 2
CID 201395: As 1 

Simple and effective.

Or? And this is where the two items outlined in the first paragraph meet. No, this is not useful. Since the output is from an analysis of PubChem, I'm sure you already figured out that the first two columns indicate the compound being analyzed. You might also work out that then the elements are given for which the atom type perception failed. You may even figure out that the number is likely to be the index in the connection table representation of the molecule. Right?

But what about machine readability? I could, of course, write the output as CSV, but then I would loose my ability to write the report in human readable format. And moreover, the list of failing atom types does not have a fixed length, as you can see in the example lines given earlier.

Now, this is where RDF comes in. If I create my output as HTML+RDFa, I can do fancy stuff. My results page could link directly to PubChem, so that I can inspect the actual compound. Though I could do that even with merely HTML. But with RDFa, I can actually make my free text log output machine readable. I can accurately annotate what bits are informative:

<div about="#200234" typeof="um:Compound">CID
  <span property="um:cid" datatype="xsd:integer">200234</span>:
  <span rel='um:hasProblem'>
  <span about='#error0' typeof='um:Problem'>
    <span property='um:hasElement'>Ti</span>
    <span property='um:hasIndex' datatype='xsd:integer'>1</span>

The file is not backed up by an OWL ontology, but where possible one would do that. Reuse of ontologies is a good thing (e.g. use a service like Schemapedia).

Now, I can easily open up this file in a web browser (follow this link) and get the same view as above. But I can also import the file directly into Bioclipse (see Semantic Web features in Bioclipse 2.2), or in any other tool that supports RDFa. I can then use SPARQL to do some first analysis, for example, with:

PREFIX um: <>

SELECT ?elem (count(*) AS ?count) WHERE {
  ?compound um:cid ?cid;
     um:hasProblem ?problem .
  ?problem um:hasElement ?elem .
} GROUP BY ?elem ORDER BY ?elem

Combine that with the RDFaDev tool I wrote about last week (see RDFaDev: HTML+RDFa development with FireFox). Now you should get some feeling of the advantages of using Open Standards: I can do some initial analysis of the results, just right there in the web browser you have open anyway:

Therefore, next time you ask your data analyst to perform some calculation, insist that he sends you HTML+RDFa log files with results. Better, ask him to put it online, and you immediately reach Step 3 in the analysis by David.

Sunday, July 18, 2010

Amazon, the Kindle edition is more expensive than the paperback??

I am writing some more educational material on cheminformatics, and wanted to link to some of the handbooks already around. I need the book details in BibTex format, so CiteULike is my primary tool to create such content. One of those books is the book An Introduction to Chemoinformatics by Leach and Gillet. I looked up the ISBN number on Amazon, and then I noted something weird:

So, the electronic copy is actually more expensive than the paperback?! Is this an artifact or a pattern? No way you can get the investment for the Kindle itself back then... :(

Saturday, July 17, 2010

Setting up a local Semantic MediaWiki with RDFIO and SPARQL support

My now former student Samuel is doing a really cool Google Summer of Code project on import and export of RDF from a Semantic MediaWiki server. His screencast of today shows very nicely where things are going!

So, time to get something installed on my Ubuntu system, based on Samuel description here:
  1. sudo aptitude install mediawiki-extensions php-apc imagemagick texlive-latex-base gs-gpl cjk-latex
  2. configure MediaWiki
    1. zcat /usr/share/doc/mediawiki/README.Debian.gz | more
    2. follow instructions on
    3. create a MySQL account as described here
    4. do config stuff at http://localhost/mediawiki/config/index.php
  3. set up a save place for local checkouts for MediaWiki extensions
    1. sudo mkdir -p /usr/local/lib/mediawiki/extensions
    2. sudo chown egonw:egonw -R /usr/local/lib/mediawiki
  4. configure Semantic MediaWiki
    1. cd /usr/local/lib/mediawiki/extensions
    2. svn checkout
    3. cd /var/lib/mediawiki/extensions
    4. sudo ln -s /usr/local/lib/mediawiki/extensions/SemanticMediaWiki
    5. sudo nano /var/lib/mediawiki/LocalSettings.php
      1. add: include_once("/var/lib/mediawiki/extensions/SemanticMediaWiki/SemanticMediaWiki.php");
      2. add: enableSemantics('localhost');
      3. add: $smwgShowFactbox= SMW_FACTBOX_NONEMPTY;
    6. visit http://localhost/mediawiki/indSpecial:SMWAdminSpecial:SMWAdminex.php/Special:SMWAdmin
Next step is to install Samuel's extension in its requirements:
  1. set up the required extensions SMWWriter and POM:
    1. cd /usr/local/lib/mediawiki/extensions
    2. svn checkout
    3. svn checkout
    4. cd /var/lib/mediawiki/extensions
    5. sudo ln -s /usr/local/lib/mediawiki/extensions/SMWWriter
    6. sudo ln -s /usr/local/lib/mediawiki/extensions/PageObjectModel/
  2. apply the patch (where is it filed upstream?)
    1. cd SMWWriter
    2. wget
    3. patch -p0 < smwwriter-fixesfor-rdfio-20100716-r2.patch
  3. set up the RDFIO work from Samuel
    1. cd /usr/local/lib/mediawiki/extensions
    2. mkdir RDFIO
    3. cd RDFIO
    4. svn checkout .
    5. sudo nano /var/lib/mediawiki/LocalSettings.php
      1. and add the RDFIO blob given on Samuel's Install page
  4. install ARC2
    1. cd /usr/local/lib/mediawiki/extensions/SemanticMediaWiki/libs
    2. wget
    3. tar zxvf arc.tar.gz
  5. install the RDFIO pages onto the main page for easy access, add:
    1. [[Special:ARC2Admin|ARC2Admin]]
    2. [[Special:RDFImport|RDFImport]]
    3. [[Special:SPARQLEndpoint|SPARQLEndpoint]]
That's it. Not quite the 5 minutes that Samuel promised me, but I'm happy to have this available for my conference tour next month! I installed the default RDF, and got the nice default wiki page to which I can now start adding manual annotation:

If you are wondering about the use case, this RDF import is ideal for building up knowledge bases, as detailed in my Critical mass for Open Notebook Science wikis by prepopulation with RDF data post last month. Just aggregate the info on the web you can find (yes, that's another story), put it in your wiki and complement it with your local knowledge, import into Bioclipse, and run your analyses to verify your hypotheses!

Thanx to Samuel for this really great work!

Friday, July 16, 2010

RDFaDev: HTML+RDFa development with FireFox

Celso informed me in this old post about an alternative to Operator for RDFa handling in browsers, or Firefox in this case: the RDFaDev add-on. It works quite well, extracts the RDFa, reports common problems, and even allows running SPARQL directly on the web page, all from within a browser pop up window:

A new CDK default fingerprinter?

The current default fingerprinter in the CDK depends on aromaticity, but that concept is algorithmically difficult to define, and even experimentally there are multiple dimensions to this concept. Moreover, calculating aromaticity is not cheap, as it requires detecting of ring systems. The purpose why aromaticity is actually included is this: people expect a ethenol moiety to match phenol.

Now, an alternative is to not use aromaticity, but hybridization information instead: an aromatic bond is basically just a bond between two sp2-hybridized atoms. Removes some algorithmic complexity and speeds up the calculation:

The definition of the fingerprint has changed, and a bond between two sp2-hybridized atoms may not be aromatic. We can therefore expect that the fingerprint will give more false positives with substructure search. I'm hoping that Rajarshi can find some time to compare this new fingerprint in his excellent analysis he did some time ago.

The source code can be found in my GitHub repository, with the new class HybridOnlyFingerprinter.

Thursday, July 15, 2010

Cb: New Blogs #13

The Cb software is still holding... I jettinsoned the old post cache, which speeded up the processing of blogs considerably, but the system just doesn't scale right. Yet, Euan has done a great job, and the Cb site has now been online for some three years! Here are some new blogs included in the aggregation and analysis:
Happy reading!

BTW, some WordPress feeds are weird, causing the blog post titles to not show up properly in Cb. I'll investigate this soon.