Wednesday, November 25, 2009

SWAT4LS: wrapping up #1

It's already been five days since the SWAT4LS meeting (matching blog), and finally got around to writing up my personal summary. I very much enjoyed the Blue Obelisk dinner on Thursday evening with Nico, Duncan and Miguel (the CDK one).

The SWAT4LS was fun, interesting, perhaps to short, but very much appreciated! Thanx to all organizers! During the day various people tweeted the meeting, using the #swat4ls2009 hashtag (forwarded to a FriendFeed room), while Nico covered things in various blog posts which I'll link to below where appropriate. Summaries I have seen so far are from Nico and Duncan (again :), and the organizers.

The day kicked off with a presentation by Alan Ruttenberg (Nico's coverage). It nicely demonstrated where the semantic web for life sciences is going too. Particularly interesting was the integration of SPARQL with Jmol in ImmPort/JmolViz: it uses Jmol to visualize a PDB entry, while using SPARQL to retrieve atomic and residue annotation, using Jmol script (we have to thank another Miguel (the Jmol one) for taking the scripting and visualization capabilities to the next level in 2002). It always makes me proud to see one of the projects I have worked on to hit a prominent place in key note talks at conferences :)

Alan also clarified that CC0 is not a license, but a statement about the public domain nature of data; there is nothing to accept, nothing to live up to. The important is, and I am sure most of my readers are well aware of that, is that it formalized the public domain concept by wrapping it in a full CC0 statement. My recommendation to all who want to make (chemical data) available as public domain, use the CC0; just because the CC0 works in any country, and it will make a lot of your users very happy. If you cannot claim CC0 because you are not really owner (as I have seen done), do not claim the data to be public domain either then (which was done)!

There was also note of the Amino Acid Ontology, which comes closer to our groups proteochemometrics work, but I have yet to look if this can be used for or linked protein descriptors. Also interesting is the idea behind RDFHerd, a project aiming to distribute RDF data sets as installable packages. If I understood correctly, only Virtuoso is yet supported, but this thing can fly, particularly, if these packages are easily converted into Debian packages.

More wrapping up will follow, but got other business to do first now.

Friday, November 20, 2009

Linking two Virtuoso instances to one Apache server

Virtuoso comes with its own web front end, but I did not want to make that public. Additionally, I actually have two instances running, one for the GNU FDL licensed NMRShiftDB data, and one for the CC0 ChemPedia and Solubility data sets.

So, I used Apache's proxy module linking to two Virtuoso instances. These two are set up by just duplicating a data based folder and to have it use two virtuoso.ini config files. Modify one of two config files to have them run on a different port in the Parameters section, for example 1198 and 1199:
ServerPort                      = 1199
And assign a different server ports in the HTTPServer section, such as 2290 and 2291:
ServerPort                      = 2291
Then modify the /etc/apache2/mods-enabled/proxy.conf (or whatever equivalent on your system) to have two sections creating two URL rewrites proxying the request to the virtuoso server:
<Proxy /nmrshiftdb/sparql>
  RewriteEngine On
  Allow from all
  ProxyPass        http://localhost:2290/sparql
  ProxyPassReverse http://localhost:2290/sparql

<Proxy /cc0/sparql>
  RewriteEngine On
  Allow from all
  ProxyPass        http://localhost:2291/sparql
  ProxyPassReverse http://localhost:2291/sparql

Thursday, November 19, 2009

ChemPedia RDF #1: the SPARQL end point

Well, you might spot a pattern here; yes, another chemical SPARQL end point (actually, it shares the end point with the Solubility data). This time around Rich's ChemPedia. Taking advantage of the CC0-licensed downloads, I have created a small Groovy script (using this JSON library) to convert the ChemPedia JSON into Notation3:
import net.sf.json.groovy.JsonSlurper;

input = new File("substances.json")
json = new JsonSlurper().parse(input);

println "@prefix dc: <>";
println "@prefix cp: <>";
json.each { it ->
  println "<" + it.uri + "> dc:identifier \"" + it.gsid + "\";";
  println " <> <" + it.inchi + ">;";
  println "  <> \"" + it.inchi + "\".";
  if (it.namings.size() > 0) {
    for (int i = 0; i<it.namings.size(); i++) {
      naming = it.namings.get(i);
      namingURI = it.uri + "/naming" + i;
      println "<" + it.uri + "> cp:hasNaming " +
        "<" + namingURI + ">.";
      println "<" + namingURI + "> a cp:Naming;";
      println "  cp:hasName \"" + + "\";";
      println "  cp:hasStatus \"" + naming.status + "\";";
      println "  cp:hasScore \"" + naming.score + "\".";
After uploading it into Virtuoso (now using DB.DBA.TTLP instead of DB.DBA.RDF_LOAD_RDFXML_MT ), we can now have our regular SPARQL fun with the data from ChemPedia. For example, list the 10 names with the most votes:
prefix dc: <>
prefix cp: <>

select distinct ?name ?score where {
  ?s a cp:Naming ;
     cp:hasName ?name ;
     cp:hasScore ?score .
} ORDER BY DESC(?score) LIMIT 10 

Open Notebook Science Solubility: the SPARQL end point

The Open Notebook Science Solubility challenge is an project crowd sourcing solubility of organic compounds in non-aqueous solvents. I have been working on RDF-ing this data: And this resulted in a joint chapter in the nice Beatiful Data book.

What I had not done so far, is set up a SPARQL end point for this data, like I did for the NMRShiftDB data.

Now, however, a Virtuoso-powered SPARQL end point is available, and I hope this will seen get picked up by the other nodes on the ONS Solubility project. It is not a auto-synchronized link, though.

Possible advantages include that the client can perform any query and get these results in various formats, including JSON. For example, follow this link to get all solutes in JSON format.

The matching SPARQL looks like:
prefix dc: <>
prefix ons: <>

select distinct ?s ?title where {
  ?s a ons:Solute ;
     dc:title ?title .

Wednesday, November 18, 2009

CDK 1.2.4: the authors

The CDK 1.2.4 changelog I posted earlier was directly created from git output. Git has many features which makes such thing simple. Here's a list of authors of the 1.2.4 change set:
56 Egon Willighagen
9 Rajarshi  Guha
5 Stefan Kuhn
2 mark_rynbeek
1 Uli Köhler
1 Rajarshi Guha
1 Peter Odéus
1 Paul Turner
1 Miguel Rojas Cherto
1 Arvid Berg
This is just the number of commits, and many of mine are logistic in nature. You can also notice that Rajarshi has changed his name (removed the extraneous space :). Thanx to all of authors for contributing to this release! I am happy to see a few new names in this list, which seems to indicate that the people are settling in on the whole move from Subversion to Git.

This list was created with this command adapted from this StackOverflow question:
git log --pretty=format:%an cdk-1.2.3..cdk-1.2.4 | awk -- '{ ++c[$0]; } END { for(cc in c) printf "%5d %s\n",c[cc],cc; }' | sort -n -r

CDK 1.2.4: the changes

Here is the changelog of CDK 1.2.4 which I am about to upload to SourceForge:
  • Fixed param name 743bad3
  • Updated the makefp3d target to work with the current build system bbb78ee
  • Set up a branch for the 1.2.4 release 4801d79
  • Fixes bug 2898399. Updates to the SMARTS parser to handle proper matching for explicit hydrogens (including H, 1H, 2H and 3H). SMARTSQueryVisitor updated to take into account different isotopes of H. Also updated unit tests to take into account proper H matching. Added a unit test to further check H matching. b67d76a
  • Added tests to match hydrogens 45a7f54
  • Reworked the tests for bug 2898032. Updated Javadocs for smiles generator 7f68b07
  • Added unit test to confirm and check for bug 2898032 924b563
  • Updated UIT to handle single atom queries and added a unit test for bug 2888845. Also updated Javadocs to specifically note behavior of single atom queries dfb2805
  • Added generation of java source jars e33fba2
  • Fixed matchers to allow XML without new lines (closes #2832835) f9a0552
  • Added unit tests for detection of PubChem XML files. 571f434
  • Overwrite unit tests, because there are no change events passed around at all for the NoNotification interface implementations 36f295b
  • Added missing unit tests for IChemModel event propagation for the ICrystal field 2993e0c
  • Fixed propagation of change events to IChemModel when modifications are made in child IChemObjects 0c8a88f
  • Fixed unit tests: the IChemModel.setFoo(null) should actually give a change event on the listener of the IChemModel, and not after unregistering of the Foo object. b833176
  • Added unit test to the function of the new IO setting to force 2D coordinate output. 4e2b2bf
  • Added writer IO option to force writing of 2D coordinates if 3D coordinates are present too, which now are preferably outputted. 0e6aa2c
  • Added unit test to verify that if 2D and 3D coordinates are available, the 3D coordinates are outputted. 56852f8
  • Fixed Taglets: only return HTML if the Tag is really given; the toString() method is given for all cases, not just when the tag is found 1107fb2
  • Fixeda bug which was causing various parts of the DescriptorEngine to fail - it was trying to instantiate a non-descriptor class which happens to reside in the descriptor package directory. This fix is a bit kludgy - ideally only descriptors should be in that directory 0242d9a
  • Fixes ClassCastException when not IMolecule 6f3e848
  • Upgraded to PMD 2.4.5 with many bug fixes, giving more accurate error reports f29a66b
  • Added missing dependency on cdk-diff, being used in one of the unit tests 0e287dd
  • Fixed methods names to match those in the test class 789a314
  • Fixed test method name to match the expected patters, fixing a coverage test fail ac13619
  • Removed duplicate code: MolecularFormulaTest now extends AbstractMolecularFormulaTest b8651c7
  • Fixed test method annotation to point to the right method bb7d341
  • Added missing @TestMethod annotation f6f759b
  • Added modules that were missing from the PMD testing 073e5ec
  • Added modules that were missing from the doccheck testing 10dc19c
  • Patch for bug 2843445. Aims to fix generation of NaN coordinates by SDG d1397fe
  • Fix the unit test to not give a 'input must support mark' exception on some platforms, by wrapping the InputStream in a BufferedInputStream. 6f6f41e
  • Added missing dependencies 8759481
  • Added ioformats to modules to test 56289e2
  • Use StringBuilder to aggregate the field data, which gives an huge performance boost for SD file where multiline field data is found. df35f02
  • Use StringBuilder to aggregate the field data, which gives an huge performance boost for SD file where very much field data, like the ChEBI_complete.sdf eac8266
  • Factored out steps in reading the SD file data block 678e7ca
  • Bumped version, to make it clear this is not the 1.2.3 release 8c8166a
  • Fixed registering on the cdk.threadnonsage tag (closes #2796362) d451576
  • Removed obsolete pattern from old svnrev tag c8f5a72
  • Fixed JavaDoc to remove traces of the old svnrev Tag 1a70488
  • Synchronized exception message with implementation (fixes #2844333) c70b79c
  • The Pauling Electronegativity is copied in configure as well. I can't see why not copy everything we have. 3fd2b17
  • Added bug annotation 38d0235
  • test case for bug #2846213 f84c53b
  • Fixed perception of N.planar3 where N.sp2 was detected, by now taking into account the given hydrogen count. 1714de2
  • Fixed perception of benzene with all single bond, but hydrogen count 1 and bonds flagged aromatic. In this case, the type is C.sp2 not C.sp3. 05e0be3
  • Added assertions to unit test for values being not null 863b0a5
  • Added two unit tests for the same problem: carbon atom types are not correctly perceived if bond order info is SINGLE only, and hydrogen count and aromaticity flag is set. f19a451
  • Moved class into a org.openscience.cdk package, which seems to work now. I'm puzzled why it did not before. Solved several unit test fails. b055c6b
  • Merge branch 'cdk-1.2.x' of ssh:// into cdk-1.2.x f77db9c
  • Unsealed the XOM jar to allow having the CustomSerializer 3b82340
  • Fixed Javadocs error e0304bf
  • Fixed a wrong javadoc tag. Also removed svn tag in the SMARTS parser JJT file, replaced with git tag c888773
  • Added support for 'public enum's 4bf822d
  • corrected bug in bondtools.isStereo(IAtomContainer container, IAtom stereoAtom). A comparision of atom symbols in a nested loop was using the counter of the outer loop twice. Note it worked before, because there is a sort of fallback to Morgan numbers. fallback to morgan (fixes #2830287) 025fb47
  • added a new test for bondtools 13f72bd
  • Fixed inconsistency between accepts() and write: also support writing of IAtomContainerSet and IAtomContainer as accepts() indicates (fixes #2827745) 6380578
  • General test for testing consistency between write() and accepts(), testing that all accepted IChemObject's can also be written f0678eb
  • Added unit test for bug #2826961: inconsistent atom typing for two SMILES. Unit test does not show a fail, ruling out a CDK bug 42e45ef
  • Remove erroneous throws statement f8cfea8
  • Bug found calculating the exact mass given a molecular formula when it is negative charged. 3d1de45
  • Fixed reading of the cdk/dict/data/elements.owl database which is now in OWL 73225a0
  • Fixed issue 2458210: use assertNotNull(foo) etc instead of assertTrue(foo != null). 182afe6
  • Added minimum equivalents for BondManipulator.getMaximumBondOrder() methods 6e12696
  • Fixes asserts: after removal *no* change should be recorded 3b9fa30
  • Added IO option to disable generator of XML declaration statements in the output CML. 74451b8
  • Added generics, and consistified code by always returning a List of the same '?'. (And some 80 chars fixes in the JavaDocs.) d6337cd
  • Added unit tests to test that when a [Molecule|Reaction|Ring]Set has been removed from a ChemModel, the ChemModel should unregister as listener. 63e6c01
  • Added unit tests for event propagation from [Molecule|Reaction|Ring]Sets to ChemModel. e011035
  • More testing of flags. abb5384
  • Fix for junior job id: [ 1837692 ] Test methods should throw only one Exception. 8c38536
  • Fixed missing imports and wrapped to 80 chars fd2d2df
  • Better excpetion handling in builder3d: bc5837d
  • Fixed serialization of IAtom's with null formal charge to not cause NullPointerExceptions acc8012
  • Added unit test for serialization of null formal charges into the MDL molfile format (which currently fails) df57aea
  • Updated Javadocs for SMARTS query tool to indicate unsupported features e1da4c0
  • Cleaned up source file to remove spurious line endings 3d7adae

This overview was created with this Linux one-liner:
git log --oneline cdk-1.2.3.. | sed 's/\([a-f0-9]*\)\s\(.*\).*/<li>\2 <a href="http:\/\/\/git\/gitweb.cgi?p=cdk\/cdk;a=commit;h=\1">\1<\/a><\/li>/'

Wednesday, November 11, 2009

BlueObelisk StackExchange (.com)

Oh no, not another communication channel?! We already have Google Wave! (BTW, I have quite some new invites...)

Well, you are right. But I could not resist:

No, it is not using an Open platform, but plenty of Windows and Max users among us... the data is CC0.

Update: any question about Open Data, Open Source, or Open Standards (ODOSOS) is welcome. As well as any question on if and how some chemical question could be answered with ODOSOS tools. It is not restricted to the Blue Obelisk or the projects under the wings of the Blue Obelisk. All Open Data, Open Source, and Open Standards in chemistry is worth asking about.

Saturday, November 07, 2009

Call for Collaboration: JavaDoc validation with OpenJavaDocCheck

I reported recently about my efforts to write an Open Source DocCheck replacement. I received the first patches (from Rajarshi), and brought it online in a CDK branch (see this Nightly page).

This list shows a mix of tests that are now implemented in OpenJavaDocCheck itself, but the third line is actually a test that is plugged in and specific for the CDK. This is an important feature, I think, and allows users of OpenJavaDocCheck to add functionality is that is not interesting to the general public, but very interesting for the JavaDoc being analyzed. Well, at least, it is to our CDK project :)

The current list of tests is still quite small, and consists of these tests:
  • test if each class and method has JavaDoc
  • test for missing @return tags
  • test for missing @param tags
  • test for @returns instead of @return
  • test @param template code, such as added by IDEs like Eclipse
  • test @exception template code, such as added by IDEs like Eclipse
  • test for redundant @version tags
I am now seeking feedback on the current code base, and potentially collaboration with writing more JavaDoc validation tests. There is enough to do, and I have been thinking on tests for:
  • spell checking JavaDoc
  • checking for 404s of web pages linked with <a href> in the JavaDoc
  • well-formedness of the HTML in the webpages
And about:
  • a PMD-like system to allow people to choose which testing they want or not
  • an Eclipse plugin

Wednesday, November 04, 2009

New Bioclipse Features: Kabsch Alignment, RMSD Distance and Tanimoto Simarlity Matrices

We recently submitted a second paper on Bioclipse, and have worked hard in the past two weeks on addressing the reviewers' questions (and we love these feature requests! See also these two blogs). One reviewer seemed very interested in seeing docking available in Bioclipse. While we do not have a full docking feature set up for Bioclipse, we do have functionality to deal with 3D structures, though our researched urged us to focus on the 2D side of cheminformatics so far.

To strengthen our intentions towards the 3D cheminformatics world, we have implemented a few new features, using CDK functionality. For example, we added Kabsch aligment and the related RMSD between molecular structures implemented as both popup menus as well as manager methods. The manager method you can see in action in MyExperiment workflow 937, which you can download directly into Bioclipse with one simple command (see Bioclipse Manager for
var smileses = new Array("CC(C)C", "CCCN", "CCC=O");

var unaligned = cdk.createMoleculeList();
for (i=0; i<smileses.length; i++) {
  mol = cdk.fromSMILES(smileses[i]);
  mol = cdk.generate3dCoordinates(mol)

var aligned = cdk.kabsch(unaligned)

for (i=1; i<aligned.size(); i++) {
Now, we do have to update the use of Jmol in Bioclipse, and a big overhaul is scheduled for the 2.4 released in February next year. But you get the idea.

As said, there are two stories to adding this new functionality. Because we want all GUI interaction the user performs to be recordable (Scientist 1: What did you do to get those nice results? Scientist 2: I pushed that button in the that long menu. Scientist 1: What button is that? Scientist 2: Wait, I send you the BSL script with a Google Wave.)

The managers that allow this recording is Bioclipse specific, and also the reason why it would not be trivial to make a general Bioclipse plugin for Eclipse... some Spring magic is used to inject the managers into the JavaScript language. Anyway, the second thing is to add a GUI element, like popup menus. Now, this is a particular area where Eclipse excels. Now, I did have to ask for the details, as I am not using this daily (I'm doing science, not IT), but Ola was kind enough to give me the pointers for it.

The below configuration snippet links the pop up action to Bioclipse Navigator content (you know, where your MDL SD, CML, script and other files show up in Bioclipse). But only if I have selected 3 or more files! And, only if those files are actually some molecular content with 3D coordinates! And Bioclipse inherits this functionality by using the Eclipse platform.
    label="Perform Kabsch Alignment"
      <with variable="selection">
        <count value="(2-"/>
        <iterate operator="and" ifEmpty="false">
          <adapt type="org.eclipse.core.resources.IResource">
              <test property="org.eclipse.core.resources.contentTypeId"
              <test property="org.eclipse.core.resources.contentTypeId"
              <test property="org.eclipse.core.resources.contentTypeId"
When Bioclipse is run, this looks like:

And the alignment results will nicely show up in a Jmol viewer (while it is implemented as an Eclipse editor, it is not yet):

The first screenshot also shows the new pop-up menus for calculating two matrices for 3 or more molecules. One is based on the RMSD of the 3D atomic coordinats of the atoms in the MCSS (BTW, Asad's SMSD work is making its way into the CDK library, and will be available in a later Bioclipse version too.) and will create a distance matrix. The second new pop-up menu used the Tanimoto similarity measure based on CDK fingerprints on the selected chemical graphs. If the Bioclipse Statistics feature is installed, the created CSV files will open up in a matrix editor:

Kabsch alignment of protein backbones is planned for a later Bioclipse release, but an important feature for our groups proteochemometrics work.


While I am still looking around for a assisting/associate professor position, there are two milestones around my scientific work I want to briefly mention here. This blog is the 500th blog on chem-bla-ics, and the two CDK papers have combined reached 100+ citations as counted by Web-of-Science, as can be seen on my ResearcherID profile.

Bioclipse Manager for

Some time ago I wrote about using Bioclipse to query to SPARQL end point. I think I had not mentioned that I have also written a manager to download MyExperiment Bioclipse Scripting Language (BSL) scripts (though there are no GUI elements yet):
[921, 928, 889]

The returned lists give the workflow numbers for matching BSL scripts, which you can then simply download with:
> var file = myexperiment.downloadWorkflow(937)