Monday, May 19, 2008

Development of the new JChemPaint

A quick screenshot, after some work on the JChemPaint code based on CDK trunk/. Nothing much to see, but a rather small code base, which is good. Today, I have set up cdk/cdk/trunk/ and cdk/jchempaint/trunk as Eclipse plugins, allowing the second to depend on the first. So, no more use of svn:externals. This is what it now looks like, and basically formalizes the end result of Niels' work of last year:

A possible spin of is that Bioclipse2 can use these plugins too, instead of defining plugins itself.

To reproduce the above screenshot, just import cdk/cdk/trunk and cdk/jchempaint/trunk into Eclipse, and run the TestEditor from the JChemPaint plugin.

Friday, May 16, 2008

Metware Status Report

Following many, many others, I finally got myself a SlideShare account and uploaded a recent presentation on MetWare, our metabolomics data warehouse project. Some spoilers: SQL, RDF/SKOS, JSF.

Saturday, May 10, 2008

John Wilbanks replies to the ChemSpider/OpenData discussion

Not long after I posted my view on things, John posted his reply on the ChemSpider/OpenData discussion. His comment was merely to illustrate an internal advice to some organization, which got accidentally leaked. Anyway, a must read, with two good links to further reading on open data licensing.

His blog mentions the concept of public domain, where data might be dumped, but I always understood that the US public domain concept is different from that of mainland-EU, German law in particular. This second 'good link' points to a license which formalizes this 'public domain' idea. And reading it, I realize that I have read it before. But I had completely forgot about it.

A quick reread of these two links, tells me that it indeed is BSD-versus-GPL all over again; with the Science Commons license on the BSD side, and CC-BY-SA at the GPL side. The first surely makes the life easier of aggregators who wish to combine licenses. Can't argue with that.

Then again... what's wrong with a bit of viral character in the license? What's wrong with the statement that 'you may use my data, if I may use your aggregated data with the same license'? That limits your what you practically can do, but does not limit your freedoms.

Does ChemSpider really violate Open Data with CC SA?

ChemSpider is afraid they are doing something bad because they release their data as CC-BY-SA. Because, John Wilbanks says in Peter's blog:
    I would add to it that I'd like to see a meaningful discussion of the
    risks of Share Alike and Attribution on data integration. Chemspider's
    move to CC-BY-SA fits into this discussion nicely - it's a total
    violation of the open data protocol we laid out at SC, which says "Don't
    Use CC Licenses on Data" - but it does conform inside the broader OKD.
Now, let's take this into pieces.
  1. John notes that ChemSpider is in compliance with the OKD. This means, that ChemSpider thinks about Open Data just like the Open Knowledge Foundation does. I've scanned through the OKD, and it indeed seems to support the BY and SA clauses of the CC. So, Chemspider did not do a bad thing.
  2. Data integration is tricky: you have to keep track of license information on an entry-by-entry level. For each fact, you keep to track the source, and associate the source with it's original license. For example, the NMRShiftDB information in ChemSpider should be GNU FDL.
  3. OpenX licenses may be viral. This holds for the GNU GPL as well as for the CC-BY-SA. Nothing new there. It just requires that when you would like to incorporate the ChemSpider data into a larger database, that database has to be CC-BY-SA too, or likely at least CC-SA.
Summarizing, I think ChemSpider did a good thing, and that ChemSpider does not violate the OpenData idea, but instead, that the CC-BY-SA and the OKD violates John's requirements for integrating data resources (apparently based on a two year legal study). That has nothing to do with ChemSpider.

Now, people will always have different opinions on Openness. The original BSD clause had a restrictive 'advertisement' clause, not Open enough for at least the Debian Free Software Guidelines (DFSG), while still open source. The clause was later removed from the BSD license.

Another Debian example is Firebox, which is named IceWeasel in Debian, because the 'license' on the Firefox name is not open enough.

Another problem with the definition of Openness, is the viral aspect of some licenses (see earlier). For some, the GPL is not open enough, because it does not give people the freedom to license their software they like themselves, something the BSD and MIT licenses do allow. There is ongoing debate (and that should be ongoing) on how much freedom a license must provide to be called Open. The whole OpenAccess discussion is similar (see e.g. Peter's story on this), where the discussion on the minimal amount of freedom is even worse.

Should we worry about ChemSpider being 'only' CC-BY-SA? Maybe. Data is not software, but I disagree that viral license would be OK for software, but NOT for data. That's just BSD-versus-GPL all over again. I am happy about OpenBabel being GPL, and I am happy about ChemSpider being CC-BY-SA too.

All that said, these discussion are important. And creating good definitions of what freedoms are required, are crucial in deciding whether something is Open. The Blue Obelisk does not have/use such definitions yet, and we should start discussing this, and define a Blue Obelisk ODOSOS Guidelines. Please no funny jokes about how we can boogy then :)

Now, looking forward to hearing what you think about these issues... Looking forward to the other blog items!

Thursday, May 08, 2008

Re: What should a Nature Chemistry paper look like?

Neil wondered "what a Nature Chemistry paper should look like", and asked the following questions. Below are my answers.

1. HTML vs PDF: does anyone read the HTML articles? Do you read the PDF on-screen or print it out?
I typically read the HTML to scan if a paper is interesting for me. But because electronic paper is still too expensive, I typically make a print of the PDF. I would love to print the HTML instead, if only it was not clouded with advertisement, link menu's etc. Many websites have a 'Print View' with just the content. Nicely layed out, but without the menus/etc. NC should adopt this feature (or did I miss that option?).

2. Big vs little graphics: what does everyone else think about the tiny size of the graphics in ACS html articles?
I hate the small figures, because they make scanning the HTML more difficult.

3a. Tagging/’semantic web’: what do you think about the toys on the RSC’s Project Prospect?
I love tagging and semantic work up. Just browse my blog. I blogged a bit about Project Prospect in the past, and also about using RDFa for semantic markup of chemistry. I must also mention the nice semantic work by the Beilstein Journal. Check the HTML source for all the semantics and the link to the papers RDF version. I discussed some of that work earlier.

3b. What kind of things would you like to see tagged/linked to other content in Nature Chemistry?
I'd really like to see that Nature would pick up social tagging. For example, Euan/Ian/etc can tell you now tags from blogs/etc, can be used to find relevant other literature. Show Connotea tags for NC papers on the NC website. Show related literature based on tag matching. I also recommend taking advantage of and Chemical blogspace to complement papers with user comments, or at least link to them (just like linking to F1000). Regarding domain knowledge: link to whatever open database present, and encourage authors to provide links to public databases, e.g. by providing InChIs for molecules the describe, PDB identifiers, etc, etc.

4. 3D molecular structures: do these help your understanding of a paper?
Absolutely! Henry Rzepa and Christopher Braddock recently showed how one can take advantage of Jmol to explain what is going on (doi:10.1021/np0705918), but the ACS forgot to make it part of the main text :) A brilliant recent use of Jmol in explaining chemistry, is ProtopediA that uses Jmol scripts to visualize statements in the textual description in the wiki.

5. How useful to you are InChIs and SMILES?
While there is an OpenSMILES project (part of the Blue Obelisk movement) to standardize SMILES, I'd go for InChI, and InChIKey if you mind the length of the InChI itself.

6. Forward linking: do you use it? Would you use an RSS feed that alerted you to new citations of a particular paper.
I am not sure what forward linking is, so cannot comment on that. However, I would use RSS feeds to alert me of new citations of a particular paper. Right now, I am relying on Web-of-Science to do this for me, but RSS are an excellent alternative. BTW, I was not aware of such feeds yet, and could use some advertisement!

7. Would you actually comment on papers if there was a comments box at the end?
No, I would rather comment in my blog instead. That would place the comments in some perspective. See also my comment on question 3b.

8. We really like the Biochemical Society’s HTML article style – do you?
No, please do not inherit that layout. The use of frames should be discouraged anyway. It seems to be used to easily add interactivity, but I am positive that Ajax/etc can be used to do all this inline.

Saturday, May 03, 2008

Wicked chemistry and unit testing

After a discussion on starting development releases for CDK on cdk-devel, the discussion continued on the state of the CDK atom typer. Dan and Rajarshi have done tests in the past against PubChem and its DTP/NCI subset. Rajarshi made his analysis part of CDK Nightly, and provides but a summary (which seems broken: zero fails) and a detailed list.

Dan, do I understand correctly that those Structure Evaluation:No Comparision - Unparameterized Atom - S. lines in the Depositor-Supplied Comments section on PubChem are based on CDK trunk? That would be a great honor! Anyways...

The amount of atom types we use to describe the chemistry we observe is overwhelming (even without charged or radical atoms). And, most atom type lists are quite limited in what they represent. However, having an explicit list allows the computer to decide if it can do reasonable calculations on a structure. Always filter your data to screen for unrecognized atom types, before heading of to, for example, QSAR calculations!

Now, many fails are because of the incomplete CDK atom type list (e.g. Au in SID:413374), or because the atom typer code has a bug (e.g. SID:403517). And these screenings against PubChem provide a nice priority list. However, others are either because the used SDF format cannot represent the chemistry (e.g. SID:420394), or the entry is a plain wrong (e.g. SID:301178). The latter two types of fails, I am annotating using for others to comment on (just tag the same page using, and I'll see the comments show up.

Unit testing
For the first two types of fails, basically three things need to be done:
  1. add the atom type to the ontology
  2. write a unit test for CDKAtomTypeMatcherTest
  3. add perception code to CDKAtomTypeMatcher
Because we cannot use SMILES or file readers for writing these tests (than we can confounding of error sources), we have to hard code the chemical structure, which may be a bit cumbersome.

Unless you use the CDKSourceCodeWriter! This IChemObjectWriter creates CDK source code, staring with a IMolecule. Now, because our bug reports are derived from fails against the PubChem screening, we can simply use this BeanShell code to download a structure from PubChem and convert it to CDK source code:

import org.openscience.cdk.Molecule;

if (bsh.args.length == 0 || bsh.args[0] == null) {
System.out.println("Syntax: pubchem2unittest.bsh [CID]\n");

String cid = bsh.args[0];
String urlString = "" + cid;

URL url = new URL(urlString);

MDLV2000Reader reader = new MDLV2000Reader(url.openStream());
Molecule mol = Molecule());

StringWriter stringWriter = new StringWriter();
CDKSourceCodeWriter writer = new CDKSourceCodeWriter(stringWriter);

For example, I am currently debugging a sulphur atom type perception problem, for which the simplest substructure looks like (sid=12279910, InChI=1/C2H7NS/c1-4(2)3/h3H,1-2H3):

I can convert this PubChem entry to CDK source code with:
$ bsh -classpath dist/jar/cdk-svn-20080221.jar tools/pubchem2unittest.bsh 12279910
Resulting in this output which I can copy/paste into my unit test:
IMolecule mol = new Molecule();
IAtom a1 = mol.getBuilder().newAtom("S");
a1.setPoint2d(new Point2d(2.866, 0.25)); mol.addAtom(a1);
IAtom a2 = mol.getBuilder().newAtom("N");
a2.setPoint2d(new Point2d(3.7321, 0.75)); mol.addAtom(a2);
IAtom a3 = mol.getBuilder().newAtom("C");
a3.setPoint2d(new Point2d(2.0, 0.75)); mol.addAtom(a3);
IAtom a4 = mol.getBuilder().newAtom("C");
a4.setPoint2d(new Point2d(2.866, -0.75)); mol.addAtom(a4);
IAtom a5 = mol.getBuilder().newAtom("H");
a5.setPoint2d(new Point2d(2.31, 1.2869)); mol.addAtom(a5);
IAtom a6 = mol.getBuilder().newAtom("H");
a6.setPoint2d(new Point2d(1.4631, 1.06)); mol.addAtom(a6);
IAtom a7 = mol.getBuilder().newAtom("H");
a7.setPoint2d(new Point2d(1.69, 0.2131)); mol.addAtom(a7);
IAtom a8 = mol.getBuilder().newAtom("H");
a8.setPoint2d(new Point2d(2.246, -0.75)); mol.addAtom(a8);
IAtom a9 = mol.getBuilder().newAtom("H");
a9.setPoint2d(new Point2d(2.866, -1.37)); mol.addAtom(a9);
IAtom a10 = mol.getBuilder().newAtom("H");
a10.setPoint2d(new Point2d(3.486, -0.75)); mol.addAtom(a10);
IAtom a11 = mol.getBuilder().newAtom("H");
a11.setPoint2d(new Point2d(4.269, 0.44)); mol.addAtom(a11);
IBond b1 = mol.getBuilder().newBond(a1, a2, DOUBLE);
IBond b2 = mol.getBuilder().newBond(a1, a3, SINGLE);
IBond b3 = mol.getBuilder().newBond(a1, a4, SINGLE);
IBond b4 = mol.getBuilder().newBond(a2, a11, SINGLE);
IBond b5 = mol.getBuilder().newBond(a3, a5, SINGLE);
IBond b6 = mol.getBuilder().newBond(a3, a6, SINGLE);
IBond b7 = mol.getBuilder().newBond(a3, a7, SINGLE);
IBond b8 = mol.getBuilder().newBond(a4, a8, SINGLE);
IBond b9 = mol.getBuilder().newBond(a4, a9, SINGLE);
IBond b10 = mol.getBuilder().newBond(a4, a10, SINGLE);

Friday, May 02, 2008

Comparing JUnit test results between CDK trunk/ and a branch #2

I reported earlier on how to compare unit test results between CDK trunk and a branch. Later, I noted that the diff typically overestimates the fail count, when unit tests had been moved to a different module. Therefore, a sort has to be added. The code is also updated for the SVN directory restructuring:
$ cdk cdk/
$ cdk trunk/
$ ant -lib develjar/junit-4.3.1.jar -logfile ant.log test-all
$ cd ../branches/miguelrojasch-CMLReact
$ ant -lib develjar/junit-4.3.1.jar -logfile ant.log test-all
$ cd ..
$ grep Testcase trunk/reports/*.txt | cut -d':' -f2,3 | sort > trunk.results
$ grep Testcase branches/miguelrojasch-CMLReact/reports/*.txt | cut -d':' -f2,3 | sort > branch.results
$ diff -u trunk.results branch.results
Obviously, you can still use wc for counting changes:
$ diff -u trunk.results branch.results | grep "^-Testcase" | wc -l
$ diff -u trunk.results branch.results | grep "^+Testcase" | wc -l
A second improvement, would be taking advantage of the ant.log files that are created anyway. Using the BeanShell tool tools/extractTestStats.bsh revision 10760 (see also this blog on bsh):
$ bsh trunk/tools/extractTestStats.bsh trunk/ant.log | grep run | grep -v total | grep -v antlogFile | cut -d' ' -f1-4 | sort > trunk.overview
$ bsh trunk/tools/extractTestStats.bsh branches/miguelrojasch-CMLReact/ant.log | grep run | grep -v total | grep -v antlogFile | cut -d' ' -f1-4 | sort > branch.overview
$ diff -u trunk.overview branch.overview