Monday, January 21, 2013

CHEMINF example #1: encoding an InChI

For our Open PHACTS project I am converting commercial data into RDF. That process requires choices to be made. For example, what predicates to use, as outlined in our W3C note, this paper from the HCLS LODD group, or our Open PHACTS RDF Guidelines document.

One of the choices I made for the ChEMBL-RDF is to use CHEMINF (doi:10.1371/journal.pone.0025513). Partly, obviously, as I co-authored the paper, and think the ideas are good. In particular, the more complex encoding of the information allows us more expressiveness, including versioning of identifier schemes, etc.

But, importantly, it defines a nice specification, hoping it will become a standard. I have been using it not just as producer in ChEMBL-RDF, but also as consumer of data, e.g. in  Isbjørn. By adopting specification as a standard, we reduce the amount of work needed to talk to each other: the fewer specifications, the fewer languages we need to support. So, to promote CHEMINF a bit, I will post a series of posts on how to encode particular bits of information, and increasingly, show the power of CHEMINF. New examples will be added to this already existing Examples page.

As a first example, this is how you add an InChI, and I am promoting these examples among the commercial data providers. The VU team informed me that the increased triple complexity has positive effects on the searching, providing more hooks for indexing, I guess (Antonis, is that correct?):

:Methane rdfs:subClassOf cheminf:CHEMINF_000000 ;
  rdfs:label "methane"@en ;
  cheminf:CHEMINF_000200 [
    a cheminf:CHEMINF_000113 ;
    cheminf:SIO_000300 "InChI=1/CH4/h1H4" .
  ] .

This example shows methane. The label and subclass info is not important for adding the InChI, and just here to give the example some body.

Yes, a common objection against CHEMINF is the cryptic predicates, so that needs some explanation. CHEMINF_000200 is 'has attribute' and SIO_000300 is 'has value'. So, we are linking some attribute to the molecule, and give that attribute a value. That is the general pattern.

It leaves us only with the question what that pattern is. But by having the attribute as a resource itself, we can associate further information to that. For example, what kind of attribute it is. In this case, it is an CHEMINF_000113, which is an 'InChI descriptor'.

Now, just to make the point of the versioning, there is a subclass of CHEMINF_000113: CHEMINF_000396 which is an InChI calculated with the 1.04 version of the software. An accurate question would be here, what InChI layers were used(??) ... very well done indeed!


  1. Hi Egon,

    What's the value of this indirect modeling? Why not directly use the cheminf:SIO_000300 predicate off of the methane instance?


    1. Because you would not know what the value represented. Even for an InChI that may sound trivial, until you realize you do not know which layers have been calculated. For a Standard InChI it would be more straightforward, but this is exactly why they prefix the InChI with "InChI=1S/" so that you know what it is. SMILES, on the other hand, are notoriously hard to recognize: the error rate is really high in deciding if a particular field is a SMILES are not. Too many false positives.

      What you could do, is define a new predicate, and axiomize it to be equivalent to the above construct. That is, define that the predicate is a short cut of an attribute value and the type of the value.

      In fact, it is not unlike encoding a measurement: you need a value and a type (unit for measurements).