Tuesday, July 31, 2007

Optical Chemical Structure Recognition

Days after the release of OSRA last week, I saw the optical chemistry structure recognition on the front page of my favorite Dutch /. equivalent,, Duitsers leren computer chemische structuren herkennen, written by René Gerritsen. The article discusses the Fraunhofer Institute's ChemoCR, which was, IIRC, presented as poster at last year's German Conference on Chemoinformatics (to be held again this year). Meanwhile, the mailing list had a discussion on the alternatives too; I think it is fair to say that the chemical community realizes the important of these tools. Below is a short overview of the available tools, including some important information regarding integration into workflows.

ChemoCR seems to be proprietary software, as I could not find any download, and InfoChem seems to be the party to sell licenses. The screenshot in the article seems to show that is is written in Java, but that hardly matters if not open source. The project is said to have started three years ago.

CLiDE is another commercial (expensive) program to do the job. It was developed more than ten years ago, and the most recent scientific publication is from 1997 (as the webpage states).

OSRA (see my previous blog) is opensource and uses the GPL license. It is written in C++. It does not as feature complete as ChemoCR yet, but that will surely come. This project is surely the youngest project.

I have not picked up copy of the paper Kekule: OCR-optical chemical (structure) recognition cited by Tony, so cannot say much about that right now.

It is obvious that only OSRA lends itself to embedding in reproducable workflows. Debra Banville reviewed the two commercial programs CLiDE and ChemoCR last year, along with a few other text mining tools in chemoinformatics. I am curious about her opinion of the new opensource tools in this arena.


  1. Hi Egon,
    I am also happy that Igor from the NCI came up with a nice and free Optical Structure Recognition solution.

    What about if I don't want to have my structure recognized? Should I overlay it with a gray grid or a mesmerizing color mesh?

    I sometimes think wouldn't it be better to pay some trained people and let them copy and draw all the structures from every chemistry related publication? The problem is you have to pay them on and on and on, because every day new publications come out.

    Or how trustworthy are old journals, I bet that organic chemists don't use preps from before 1920, except for some very basic chemicals. So the old stuff is not that important anyway, maybe for historic reasons. BTW I would not throw away *any* old chemistry book like some libraries did to get some free space.

    I mean if you think about it,
    why would you first obfuscate the structures by putting them on paper and then extract the information back in a very expensive and low quality way?

    I mean everybody with a little piece of brain should get idea of Open Data and Open Standards.
    Oohhh the Blue Obelisk fell on me :-)


  2. Hi Tobias,

    I guess if you do not want your chemistry recognized, you do it "The Biologist Way" (tm), give it a name or chemical formula and never show or mention the molecular structure :)

    I agree that every article should really have an InChI for each synthesized structure, in the experimental section.

  3. Someone should make OSRA run on Android ... and then connect it to some nice online service (like Toxicity prediction) and so forth.