Thursday, January 21, 2010

Extracting RDF from Chem4Word documents

Joe has released the first Chem4Word demo file, and has written about how to extract the CML with Java and with C#.

I haven't actually gotten around to fiddling with Java, but ran Strigi against it to extract RDF, while having the Strigi-Chemistry plugins installed. This is part of the RDF that came out:
    "acetic acid",
    "(8R,9S,10R,13S,14S,17S)- 17-hydroxy-10,13-dimethyl- 1,2,6,7,8,9,11,12,14,15,16,17-dodecahydrocyclopenta[a] phenanthren-3-one",
I believe there is quite some room for improvement, but it's a start :) Thanx to Joe for posting the public domain test file, so that other projects can start play with the exiting new technology. I should note, however, that I am not running a Microsoft OS nor MS-Word, and the saved documents source are the only way I have access to the CML right now.