Monday, November 29, 2010

Adding a new dictionary to Oscar

Say, you have your own dictionary of chemical compounds. For example, like your companies list of yet-unpublished internal research codes. Still, you want to index your local listserv to make it easier for your employees to search for particular chemistry you are working on and perhaps related to something done at other company sites. This is what Oscar is for.

But, it will need to understand things like UK-92,480. This is made possible with the Oscar4 refactorings we are currently working on. You only need to register a dedicated dictionary. Oscar4 has a default dictionary which corresponds to the dictionary used by Oscar3, and a dictionary based on ChEBI (an old version) (see this folder in the source code repository).

Adding a new dictionary is very straightforward: you just implement the IChemNameDict interface. This is, for example, what the OPSIN dictionary looks like:
public class OpsinDictionary
implements IChemNameDict, IInChIProvider {

  private URI uri;

  public OpsinDictionary() throws URISyntaxException {
    this.uri = new URI(

  // the URI is somewhat like a namespace
  public URI getURI() {
    return uri;

  // there are no stop words defined in this
  // dictionary
  public boolean hasStopWord(String queryWord) {
    return false;

  // see hasStopWord()
  public Set getStopWords() {
    return Collections.emptySet();

  // it has the name in the dictionary if the name
  // can be converted into an InChI
  public boolean hasName(String queryName) {
    return getInChI(queryName).size() != 0;

  // this dictionary can return InChIs for names
  // so, it implements the IInChIProvider interface
  public Set getInChI(String queryName) {
    try {
      NameToStructure nameToStructure =
      OpsinResult result = nameToStructure
          queryName, false
      if (result.getStatus()
        Set inchis = new HashSet();
        String inchi = NameToInchi
            result, false
        return inchis;
    } catch (NameToStructureException e) {
    return Collections.emptySet();

  public String getInChIforShortestSMILES(
    String queryName)
    Set inchis = getInChI(queryName);
    if (inchis.size() == 0) return null;
    return inchis.iterator().next();

  // since names are converted on the fly, we do
  // not enumerate them
  public Set getNames(String inchi) {
    return Collections.emptySet();
  public Set getNames() {
    return Collections.emptySet();
  public Set getOrphanNames() {
    return Collections.emptySet();
  public Set getChemRecords() {
    return Collections.emptySet();
  public boolean hasOntologyIdentifier(
    String identifier)
    // this ontology does not use ontology
    // identifiers
    return false;

Now, you can implement the interface in various ways. You can even have the implementation hook into a SQL database with JDBC, or use something else fancy. The dictionary will be used at various steps of the Oscar4 text analysis workflow.

Mind you, the refactoring is not over yet, and the details may change here and there.

Your comments are most welcome!


  1. Bit confused here. Normally I would think of a dictionary being a static mapping of names to structures, like the existing chemnamedict.xml, and quite unlike OPSIN, which is dynamic.

    Is the story here that you've merged the two?

    Or is this a generic name-to-structures treatment that doesn't distinguish the two cases?

  2. There is now a central registry, to which anyone can contribute registries. As such, the ChEBI dictionary is separate from the OPSIN one, but both can be added to the same registry.

    It's indeed important to realize the consequences of this dynamic behavior: since a dynamic dictionary like the OPSIN one does not respond to getNames(), it does not contribute to NGram building. As such, variations of names will not be recognized, if getNames() returns an empty list.

    Right now, ChEBIDictionary.getNames() returns a list of names, while the OpsinDictionary.getNames() does not. But that should be OK, as we OPSIN should recognize most valid IUPAC names anyway.

    Does that clarify things a bit?

    (Also, I am not deep into the Oscar code for very long, and though talking to Sam, David, and Lezan really helps a lot, I may have missed an important aspect of the full Oscar analysis code...)