Thursday, November 18, 2010

Oscar4 command line utilities

One goal of my three month project is to take Oscar4 to the community. We want to get it used more, and we need a larger development community. Oscar4 and the related technologies do a good, sometimes excellent, job, but have to be maintained, just like any other piece of code. To make using it easier, we are developing new APIs, as well as two user-oriented applications: a Taverna 2 plugin, and command line utilities. The Oscar4 Java API has slightly evolved in the last three weeks, removing some complexity. In this post, I will introduce the command line utilities.

Most people will be mostly interested into the full Oscar4 program, to extract chemical entities. Oscar3 was also capable of extracting data (like NMR spectra), but that is not yet being ported. The OscarCLI program takes input, extracts chemicals, and where possible resolves them into connection tables (viz. InChI).

To extract chemicals from a line of text (e.g. "This is propane.", you do:
$ java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  This is propane.
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3
For larger chunks of texts it is easier to route it via stdin, for which we can use the -stdin option:
$ echo "This is propane." | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
propane: InChI=1/C3H8/c1-3-2/h3H2,1-2H3

That way, we can easily process large plain text files (output omitted):
$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \

If you prefer RDF output, for further integration, use the -output text/turtle:
$ cat largeFile.txt | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  -stdin -output text/turtle

This returns RDF using the CHEMINF ontology like:
@prefix dc:  .
@prefix rdfs:  .
@prefix ex:  .
@prefix cheminf:  .
@prefix sio: .

  rdfs:subClassOf cheminf:CHEMINF_000000 ;
  dc:label "propane" ;
  cheminf:CHEMINF_000200 [
    a cheminf:CHEMINF_000113 ;
    sio:SIO_000300 "InChI=1/C3H8/c1-3-2/h3H2,1-2H3" .
  ] .

We can, however, also use Jericho to extract text from HTML pages, made available with the -html option, and pulling in a Beilstein Journal of Organic Chemistry paper with wget:
$ wget -qO- | \
  java -cp oscar4-cli-4.0-SNAPSHOT.jar \ \
  -stdin -html

This will return 271 chemical entities recognized in the text, matching 48 unique chemical structures.