Thursday, November 18, 2010

Oscar4 command line utilities

One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs, as well as two user-oriented applications: a Taverna 2 plugin, and command line utilities. The Oscar4 Java API has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command line utilities.

Most people will be mostly interested into the full Oscar4 program, to extract chemical entities. Oscar3 was also capable of extracting data (like NMR spectra), but that is not yet being ported. The OscarCLI program takes input, extracts chemicals, and where possible resolves them into connection tables (viz. InChI).

To extract chemicals from a line of text (e.g. "This is propane.", you do:
$ java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  This is propane.
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3
For larger chunks of texts it is easier to route it via stdin, for which we can use the -stdin option:
$ echo "This is propane." | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3

That way, we can easily process large plain text files (output omitted):
$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \

If you prefer RDF output, for further integration, use the -output text/turtle:
$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  -stdin -output text/turtle

This returns RDF using the CHEMINF ontology like:
@prefix dc:  .
@prefix rdfs:  .
@prefix ex:  .
@prefix cheminf:  .
@prefix sio: .

  rdfs:subClassOf cheminf:CHEMINF_000000 ;
  dc:label "propane" ;
  cheminf:CHEMINF_000200 [
    a cheminf:CHEMINF_000113 ;
    sio:SIO_000300 "InChI=1/C3H8/c1-3-2/h3H2,1-2H3" .
  ] .

We can, however, also use Jericho to extract text from HTML pages, made available with the -html option, and pulling in a Beilstein Journal of Organic Chemistry paper with wget:
$ wget -qO- | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  -stdin -html

This will return 271 chemical entities recognized in the text, matching 48 unique chemical structures.


  1. Hi Egon, some great examples, beautifully explained - a great way of showing the power of the tools and how to start using them. Regards ~ B.

  2. Thanx, B! I'm going to try to do the same for other parts of Oscar, like the Tokeniser, and the MEMM learning algorithm.

  3. Interesting stuff. Do you have any Windows examples / best practices? (I'm not so familiar with Java)

  4. It will be very great to have this excellent feature in the form of the Firefox or Google Chrome addon to highlight and copy-past it directly from the browser. Just dreaming...

  5. @Will... Java should work on Windows too, but I do not have yet a nice GUI if that is what you are after. I am working on a Taverna plugin...

    @Vladimir, indeed! Running the full Oscar as an add-on is not feasible, but if someone would run a webservice, this should not be too difficult...

    Merging everything back into the HTML might be another... both will be outside the reach of my three months on Oscar, though.