## Tuesday, July 31, 2007

### RDF-ing molecular space

RDF might be the solution we are looking for to get a grip on the huge amount of information we are facing. microformats, and RDFa, are just solutions along the way, and Gleaning Resource Descriptions from Dialects of Languages (GRDDL) might be an important tool to get the web RDF-ied.

One important aspect of RDF is that any resource has a unique URI. These make look like a URL or even like urn:doi:10.1186/1471-2105-8-59. The recent blogs by Pierre (URL +1, LSID -1) and Roderic (Rethinking LSIDs versus HTTP URI) illustrate the pro and cons of the different alternatives.

bioGUID
As usual, the bioinformaticians are less conservative and ahead of chemists in trying new options, and several interesting website have emerged. For example, bioGUID makes the bridge between a simple URI and a resolvable URL. And, importantly, it spit RDF. This is the output for http://bioguid.info/doi:10.1109/MIS.2006.62:
<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="http://bioguid.info/xsl/html.xsl"?><rdf:RDF xmlns:bioguid="http://bioguid.info/schema/0.1/"   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"  xmlns:rss="http://purl.org/rss/1.0/"   xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/"  xmlns:dcterms="http://purl.org/dc/terms/"   xmlns:dc="http://purl.org/dc/elements/1.1/"  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">  <rdf:Description rdf:about="http://bioguid.info/doi:10.1109/MIS.2006.62">    <rdf:type rdf:resource="http://bioguid.info/schema/0.1/Publication"/>    <rdfs:comment>Generated by transforming XML returned by CrossRef's      OpenURL service.</rdfs:comment>    <dc:creator>Shadbolt</dc:creator>    <dc:title>The Semantic Web Revisited</dc:title>    <dcterms:issued>2006</dcterms:issued>    <prism:publicationDate>2006</prism:publicationDate>    <dc:identifier rdf:resource="doi:10.1109/MIS.2006.62"/>    <rdfs:comment>info URI scheme</rdfs:comment>    <dc:identifier rdf:resource="info:doi/10.1109/MIS.2006.62"/>    <rdfs:comment>CrossRef resolver</rdfs:comment>    <rss:link>http://dx.doi.org/10.1109/MIS.2006.62</rss:link>    <prism:publicationName>IEEE Intelligent Systems</prism:publicationName>    <prism:volume>21</prism:volume>    <prism:number>3</prism:number>    <prism:startingPage>96</prism:startingPage>    <prism:issn>10947167</prism:issn>  </rdf:Description></rdf:RDF>

(BTW, interesting is the use of XSLT to create HTML; it's doing the opposite of GRDDL! And this is probably the right way. Cheers Roderic!)

InChI
I wanted something similar for molecules. The unique identifier is the InChI, of course. The InChI itself is not a proper URI, so I set up a webpage to work around that (if only I had realized this some time ago, I would have urged IUPAC to use the prefix 'inchi:' instead of 'InChI='). The result is, currently, looking like http://cb.openmolecules.net/rdf/rdf.php?InChI=1/CH4/h1H4. I do not use a XSLT yet, but will do so shortly. The RDF looks like:
<rdf:RDFxmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"xmlns:iupac="http://www.iupac.org/"><rdf:Description rdf:about="http://cb.openmolecules.net/rdf/?InChI=1/CH4/h1H4"> <iupac:inchi>InChI=1/CH4/h1H4</iupac:inchi> <pubchem:cid xmlns:pubchem="http://pubchem.ncbi.nlm.nih.gov/#">297</pubchem:cid> <pubchem:name xmlns:pubchem="http://pubchem.ncbi.nlm.nih.gov/#">methane</pubchem:name> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chemistrylabnotebook.blogspot.com/2007/04/space-final-frontier.html</cb:discussedBy> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=299</cb:discussedBy> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chem-bla-ics.blogspot.com/2006/12/smiles-cas-and-inchi-in-blogs.html</cb:discussedBy> <cb:discussedBy xmlns:cb="http://cb.openmolecules.net/#">http://chem-bla-ics.blogspot.com/2007/02/invisible-inchis.html</cb:discussedBy></rdf:Description></rdf:RDF>

The system uses PHP to create the output, and has a basis pluggable system: a plugin basically spits a RDF fragment for the given InChI, and at this moment it only has a plugin for Cb, but I plan a few more. It needs some tuning and any and all feedback is most welcome. Note that the actual URI might change a bit.

Update: synchronized URI with implementation.

### Optical Chemical Structure Recognition

Days after the release of OSRA last week, I saw the optical chemistry structure recognition on the front page of my favorite Dutch /. equivalent, Tweakers.net, Duitsers leren computer chemische structuren herkennen, written by René Gerritsen. The article discusses the Fraunhofer Institute's ChemoCR, which was, IIRC, presented as poster at last year's German Conference on Chemoinformatics (to be held again this year). Meanwhile, the CCL.net mailing list had a discussion on the alternatives too; I think it is fair to say that the chemical community realizes the important of these tools. Below is a short overview of the available tools, including some important information regarding integration into workflows.

ChemoCR
ChemoCR seems to be proprietary software, as I could not find any download, and InfoChem seems to be the party to sell licenses. The screenshot in the Tweakers.net article seems to show that is is written in Java, but that hardly matters if not open source. The project is said to have started three years ago.

CLiDE
CLiDE is another commercial (expensive) program to do the job. It was developed more than ten years ago, and the most recent scientific publication is from 1997 (as the webpage states).

OSRA
OSRA (see my previous blog) is opensource and uses the GPL license. It is written in C++. It does not as feature complete as ChemoCR yet, but that will surely come. This project is surely the youngest project.

Kekule
I have not picked up copy of the paper Kekule: OCR-optical chemical (structure) recognition cited by Tony, so cannot say much about that right now.

It is obvious that only OSRA lends itself to embedding in reproducable workflows. Debra Banville reviewed the two commercial programs CLiDE and ChemoCR last year, along with a few other text mining tools in chemoinformatics. I am curious about her opinion of the new opensource tools in this arena.

## Thursday, July 26, 2007

### Further Bioclipse QSAR functionality development

I had some time to work some more on the QSAR functionality in Bioclipse. There is still much to do, but it is getting there. The calculation of a QSAR descriptor data matrix

This screenshot shows that multi-resource selection is now working, and that the calculation is now a Job. The resulting matrix looks like:

Things that remain to be done:
• work on a SDF resource
• a graph view for the matrix
• R functionality for the matrices
• JOELib support

## Friday, July 20, 2007

### OSRA: GPL-ed molecule drawing to SMILES convertor

Igor wrote a message to the CCL mailing list about OSRA:
We would like to announce a new addition to the set of chemoinformatics tools available from the Computer-Aided Drug Design Group at the NCI-Frederick. OSRA is a utility designed to convert graphical representations of chemical structures, such as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES.

OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick (GIF, JPEG, PNG, TIFF, PDF, PS etc.) and generate the SMILES representation of the molecular structure images encountered within that document.

The email does not give any information on the fail rate, but the demo they provide via the webinterface does show some minor glitches (the bromine is not recognized):

The source reuses OpenBabel and uses the GPL license. The value equal to that of text mining tools like OSCAR3, and together they sounds like the Jordan and Pippen of mining chemical literature.

### Screencasts for life science informatics

Deepak blogged about screencasting for bio topics, concentrated at bioscreencast.com of which he is co-owner. I guess it is like a YouTube for bioinformatics thingies. Jean-Claude picked this up very quickly (seen on Cb? At least I did.), and already uploaded a screencast, demoing JSpecView written by Robert. I wonder if he will upload the screencasts he made for Bioclipse too? (hint, hint ... :)

I have no idea if this site will be a success, but at least it has the right ingredients: tags, flash movies, clean UI, a blog to monitor technological changes and improvements, and a page to request screencasts (with voting). What I only miss is a one summary page for each screencast to which I can easily link, for example for my del.icio.us account.

## Monday, July 16, 2007

### The CDK data model #1

The Chemistry Development Kit has a rich set of data classes, each of which is defined by an interface. While the classes for atoms, bonds and a connectivity table are fairly straightforward, but beyond that it is sometimes not entirely clear. I will now discuss all interfaces in a series of blog items. I'll start with the IChemFile. Christoph, please correct me if I move to far away from our Notre Dame board sketch.

IChemFile
The IChemFile is the class to hold a chemical document, e.g. a MDL molfile or a PDB file. The idea of this class is that it can hold anything we can expect from a chemical document. But nothing beyond that either; a XHTML document with embedded CML is outside the scope of a IChemFile. You might wonder why the IChemObjectReaders not always just return a IChemFile. That would be a fair point, any many actually do, but somethings it is handier to return an IMolecule. A reader for MDL molfiles would be expected to return a IMolecule.

However, a document may contain much more, and the approach taken by the CDK is that a file contains one or more models. A MDL molfile is an example document with one model, while a MDL SD file would be a document with more than one model.

IChemSequence
However, the IChemFile can hold more than one IChemSequence.
Now, I honestly cannot remember why that is; a single IChemSequence should be enough. And, I actually do not remember more than one IChemSequence being used. (Anyone?) As said, the IChemSequence contains IChemModels, and nothing more really. The interface therefore just contains the basic logic of a list. Let's move on.

IChemModel
The IChemModel is much more interesting. In the CDK a model is defined as anything that occurs in one actual volume of 3D (or 2D) space. A CIF file with a crystal structures is, therefore, one IChemModel. A supramolecular aggregation of lipids, e.g. a mono- or bilayer, would be IChemModel too. This could be a time step in a molecular dynamics run. Additionally, the IChemModel may also be a chemical reaction, possibly a multistep reaction. It could be, for example, a enzyme reaction mechanism entry from the MACiE database. These three types of content are captured in the ICrystal, IMoleculeSet, and IReactionSet.

Some Examples
A CIF file would be read as an IChemFile contains an IChemSequence with one IChemModel containing an ICrystal. An MDL molfile would be read as an IChemFile containing an IChemSquence with one IChemModel containing a IMoleculeSet with one IMolecule. And, an MDL SD file, however, would be read is an IChemFile with an IChemSequence with as many IChemModels as there are molecules in the SD file; and, each IChemModel would contains a IMoleculeSet with only one IMolecule. Counter-intuitively, because one may expect the SD file, which is a set of molecules, being stored in a IMoleculeSet.

Enough for tonight. More later. For the impatient, previously I wrote up a short blog about the update notification scheme in the CDK interfaces.

### The Open Science Notebook 10 years ago

So, with all these people blogging about the Open Science Notebook (yes, each word is one distinct blog) it is worth looking back in time. To make clear what I put under the OSN: a notebook in which experimental details and outcome are written down.
So, what did the OSN look like almost ten years ago?

It looked like the early open source chemoinformatics projects, such as CompChem and JMDraw set up by Christoph (the SourceForge projects have, unfortunately, been deleted; so I cannot link to the original project pages). JChemPaint and Jmol also originate from those years.

These projects were OSNs avant le lettre: an experiment in chemoinformatics is the definition of a new (or reformulation of an old) algorithm, writing down the experiment (source code in this code), uploaded into a repository (Open Science!) for everyone to comment on, possible sent around an announcement for discussion to mailing list, and reporting the outcome (preferable in a peer-reviewed journal). While I am ranting^Wtalking about the issues, chemoinformatics is in the luxurious situation that reproducibility of a procedure is much easier, except for the missing data part.

Just wanted to say that OSN is really nothing new, not to chemistry anyway. Maybe for lab chemists. Jean-Claude has shown to be very successful in promoting these open science ideas among lab chemists, and congratulate him with the exposure in all those magazine interviews lately. Cheers!

Open Science versus Open Source
Oh, and let me make the distinction between open source in general and open science. Many of the current open source software in chemistry(/chemoinformatics) are not open science. Open science means that every step in the development process is open, where is many chemoinformatics programs are dumped into the open source sphere at the end. That is not the way it should be.

For the lab chemists: ^W is a shortcut for 'delete the previous word'.

## Saturday, July 14, 2007

### CDK Literature #2

Second in a series of articles summarizing articles that cite one of the main CDK articles for CDK News. The first CDK Literature was already half a year ago, so it was about time.

Bioclipse

Nothing much I have to say about that. Just browse my blog and you'll see that it heavily uses CDK, JChemPaint and Jmol. See also the Bioclipse blog.
Ola Spjuth, Tobias Helmus, Egon Willighagen, Stefan Kuhn, Martin Eklund, Johannes Wagener, Peter Murray-Rust, Christoph Steinbeck, Jarl Wikberg, Bioclipse: an open source workbench for chemo- and bioinformatics, BMC Bioinformatics, 2007, 8(59), doi:10.1186/1471-2105-8-59

Proteomics in 2005/2006

Review article on proteomics which mentions the CDK and JChemPaint in the data analysis section, but it does not cite them. It does cite the Bioclipse article though.
Jeffrey Smith, Jean-Philippe Lambert, Fred Elisma, Daniel Figeys, Proteomics in 2005/2006: Developments, applications and challenges, Analytical Chemistry, 2007, 79(12):4325-4343, doi:10.1021/ac070741j

Combinatorial Enumeration

Article by Andreas on SmiLib (BSD-like license) which is library for combinatorial enumeration using building blocks. The CDK is used for the addition of explicit hydrogens and the creation of MDL SD files. Andreas mentions in the article that the CDK's SMILES parser ignores stereo chemistry.
Andreas Schüller, Volker Hänke, Gisbert Schneider, SmiLib v2.0: A Java-Based Tool for Rapid Combinatorial Library Enumeration, QSAR & Combinatorial Science, 2007, 26(3):407-410, doi:10.1002/qsar.200630101

Molecular Query Language

This article is also from the group of Gisbert. Ewgenij introduces an open standard SMARTS replacement, covered in CDK News in 2005. There is an interface to the CDK, but the license of the reference implementation makes it impossible to distribute it with the CDK itself. This is rather unfortunate, because if it would have been possible, a number of implementations in the CDK, such as atom type perception, could be based on MQL. See also Jörgs blog on MQL.
Ewgenij Proschak, Jörg Wegner, Andreas Schüller, Gisbert Schneider, Uli Fechner, J. Chem. Inf. Model., 2007, 47(2):295-301, doi:10.1021/ci600305h

Golden Rules in Mass Spectroscopy

Tobias Kind wrote about structure elucidation using mass spectra, and discusses MolGen and CDK's DeterministicStructureGenerator, and mentions problems with both generators. He has been in contact with the CDK and recently did extensive tests.
Tobias Kind and Oliver Fiehn, Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry, BMC Bioinformatics, 2007, 8:105, doi:10.1186/1471-2105-8-105

## Friday, July 13, 2007

### Inter- and Extrapolation: the NMR shift prediction debate

Chemical blogspace has seen a lengthy discussion on the quality of a few NMR shift prediction programs, and Ryan wanted to make a final statement. Down his blog item he had this quote from Jeff, discussing the use of the NMRShiftDB as external test set:

“Of course customers are really interested in how accurately a prediction program can predict THEIR molecules - not a collection of external data such as NMRShiftDB.”

I'm sure none of us knows what weird chemistry people are doing; we will never know what the overlap of the NMRShiftDB test set with the customer data set is. The quote suggests it is low, but we simply do not know.

Interpolation and Extrapolation
The accuracy of prediction models is very difficult to grasp, and one can only estimate it; using a test set. If few data is available, one may opt for using the training set as test set too, and gives an estimate if the modeling method is able to predict at all. However, the outcome of this exercise is the worst possible estimate you can make. So, when possible you use an independent test set, which does not contain any molecules that were present in the training set. (Actually, one could even suggest that this must happen on a shift level, but that gives problems with HOSE-code based prediction.)

Now, what Ryan stresses in his latest blog item is that prediction test results for the various available methods does not explicitly state the amount of overlap between the training and test set, one cannot draw any conclusions. Agreed. I would, however, like to tune this even a bit further, after reading the stupid quote (of course, taking out of context). What Jeff probably aimed at, is that the prediction accuracy is only meaningful to a customer if there is considerable between the customers data set and the test set, which is what the model makers do not know.

And the overlap actually goes beyond the overlap in terms of molecular identity. It is really the overlap in terms of molecular substructures that matters: a database with alkanes but no phenyl rings will more accurately predict other alkanes not present in the training set (interpolation), but will not accurately predict compounds with phenyl rings (extrapolation). What the customer needs is that his personal data set does not require extrapolation. That is what matters.

It is interesting to realize, however, that the NMRShiftDB allows you to upload your molecules, or alternatively, you download the software (it's open source) and the data (it's open data) if you don't want to send your molecules over the internet, and the NMRShiftDB software will automatically take into account your own data set.

Thus, if you are working on a series of related molecules, you can extend the NMRShiftDB data set with already elucidated structures, reducing the prediction error for your yet related unknowns derivatives. It is that easy to include prior/expert knowledge in the NMRShiftDB. I believe the ACD/Labs software allows this too, so the quote is really meaningless. Not correct, not wrong, simply says nothing.

Open Data, Open Source, Open Standards
Now, the various releases of the ACD/Labs software show a simple, understandable trend that increasing the number of data you use for the training set, reduces the prediction error. That's because of various reasons I will not go into in this item. The ACD/Labs NMR databases are expensive, because they have to manually extract and validate the data from literature (see The Purgatory Database); so, during my PhD I only bought the CNMR and HNMR prediction packages. (Off topic: two weeks after I received my copies of the software, ACD/Labs released a new version, which they kindly sent me a copy of too. Common in opensource, but much appreciated at that time. Cheers, ACD/Labs!)

The ACD/Labs databases are likely expensive because of various reasons. And this is where the ODOSOS concept of the Blue Obelisk comes in. Open Data: if publishers would not copyright their data, NMR databases would be much cheaper to set up (see this thread in Peter's blog); assuming ACD/Labs has to pay publishers for actually setting up their database. Open Source: the various Blue Obelisk projects provide the tools to automatically create a purgatory NMR database; no humans needed for that any more. Open Standards: the data from the NMRShiftDB can be downloaded in various formats, among which CMLSpect. Being able to easily read the data, made it possible that we actually have this discussion. Sure, the open data part of the NMRShiftDB is crucial too! But the database could have used an obscure, binary, undocumented, with many software tweaks and special cases, .doc-like format, which no one could support.

Clearly, ODOSOS gives all, even proprietary, NMR prediction tools a boost, and I am very happy to see that happen. It is the point that we, the Blue Obelisk Movement, are trying to make for some time now.

## Sunday, July 08, 2007

### That big pile of paper...

Everyone of use knows that big pile of paper on your desk that contains the things we want to read, scan or just browse. I even have an electronic equivalent. Another pile contains leaflets and glossy folders from conferences, like the ACS meeting in Chicago. OK, going to get rid of those last ones, and will shortly put the links here.

The first leaflet is from Chemistry Central, one of the open access publishers. Actually, not just open access as in free access, but open access as in freedom to reuse it. One things I noticed is this text: Our submission system also allows authors to upload figures and reactions schemes in ChemDraw or ISIS/Draw file formats. What about CMLReact and CML itself? Those are formats I can author with my Blue Obelisk tools.

Then there is the proprietary Sarchitect in the area of QSAR/QSPR/ADMET. No idea about the scope or whatever. Oh, make sure to check out QSAR world, where Andreas has a column too. I also have some information on the
RSC Virtual Library which provides free access to the RSC journals for RSC member. But I am not. Green Chemistry is nice for the environment, of course, but according to the EPA, it's about more: Cleaner, cheaper, smarter chemistry. Why, oh why, does this financial incentive have to be present all the time? Are we, humans, really that stupid?

I'm sure I had more advertorials, but these must have been the highlights.

## Friday, July 06, 2007

### Standing on the shoulders of ... the Blue Obelisk

The Seven stones wondered what to do with a petaflop in science, in response to Declan's The petaflop challenge in Nature. Declan discusses in this commentary the increase in computing power and the necessity of parallel programming to make use of it. Now, I do have some ideas (e.g. enumerating metabolomic space, mining the RDF graph of our collective biological and chemical knowledge base for the one hundred most supported contradictions), but that is not what I want to talk about. It is this fragment from Declan's piece:
"I'm amazed at what he can do just using open-source libraries," [Horst Simon] says. Although there are exceptions, such as high-energy physics and bioinformatics, many labs keep their software development close to their chests, for fear that their competitors will put it to better use and get the credit for the academic application of the program. There is little incentive to get the software out there, says Simon, and such attitudes plague development.

This is something that is very familiar to many of us: developing algorithms for scientific problems is not appreciated. It worries me very much the way the scientific community currently deals with algorithms and data; it seems the community does not care about correctness or improvement at all, as long as the result illustrates what they think the (bio)chemical reality has to offer. At least, that is what effectively happens if they do no give proper credit to the scientific importance of software development.

Of course, scientific credibility of software depends on the open source nature of the software: "Given enough eyeballs, all bugs are shallow", The Cathedral and the Bazaar, E.S. Raymond. Or, in more traditional wording: science, and scientific software, must be reproducible and/or falsifiable. The Blue Obelisk Movement is trying to achieve this (DOI:10.1021/ci050400b).

The open source challenge
Therefore, I hereby challenge all experimental chemists in biologists to acknowledge the amount of scientific software they already use, and give credit where credit is due. I challenge them to stand up and say that chemo- and bioinformaticians provide the methods they rely on daily to achieve there goals. I challenge them to say that they stand of the shoulders of scientific software developers.

The article should not have been called The petaflop challenge, but The open source challenge.

## Sunday, July 01, 2007

### Atom typing in the CDK

Atom typing is one of principal activities in chemoinformatics. Atom types provide additional information that cannot be derived from the connection table that is being used, or may define what force fields terms should be used. This makes perception of atom types very important.

The CDK has a few places where atom types are perceived. The HydrogenAdder and ValencyChecker are two examples. Getting the perception wrong, makes it impossible to correctly add hydrogens (of course, hydrogen should always be explicit!) For a long time, these perception algorithms have been embedded in the classes that used them, but efforts have been undertaken to refactor the algorithms into separate classes. These can be found in the package cdk/atomtype/.

Different applications, different scheme
Now, the CDK can be a bit confusing with respect to the HydrogenAdder and IValencyChecker. Originally, the CDK had only one atom type list, the StructGen Atom Types. This list was used by the deterministic structure generator (and still is), and only defined atom types for neutral atoms, and does not know anything about hybridization states.

The first bug reports dropped in when people applied the HydrogenAdder to charged molecules. However, as said, charged atoms were not defined and the algorithm failed, not silently, just gave the wrong answer. Therefore, the Valency Atom Types list was setup, which does include charged atoms. Everyone happy again.

Later, bugs were reported about the SMILES parser, which comes with additional problems: bond orders are not explicit, and have to be deduced from the connectivity; atom type perception is the only way to decide how many bonds an atom should have, and with what bond order. However, SMILES defines hybridization states, and the CDK did not have an atom type list with
hybridization information. So, while the Valency Atom Types list was extended from the StructGen Atom Type List, a new list was created extending from the Valency Atom Type list: the Hybridization Atom Types list.

Since then, applications asked for other atom type lists, such as the MM2, MMFF94, PDB, and Sybyl atom types. The first two are used for the force field code in the CDK, while the latter two are used for the respective IChemObjectReaders.

JUnit testing the perceivers
Not all applications actually already make use of the new atom type perception classes in cdk.atomtype. It is wished that these well tested before the replace code in the classes that use those atom types. Therefore, Rajarshi and me have been working on JUnit test suites. The latest step in this process was that I transformed the test classes to extend a new JUnit4-based AbstractAtomTypeTest class. New in this class is that it report which atom types in the atom type list have been tested, and the test will fail if not all atom types are tested. The StructGen Atom Types list is mostly covered now, but for all other lists tests still have to be written (monitor the progress on CDK Nightly).

For the MOL2 atom type list, there is no Java implementation of the IAtomTypeMatcher, but we have Fortran code that can be ported (provided by Martin Ott). Anyone interested?