Wednesday, January 30, 2013

CHEMINF example #2: database identifiers

As a second example (see this first example), I will give a few pointers on how you can semantically link database identifiers to molecules with CHEMINF. The CHEMINF example page lists a number of databases, reproduced here:

CHEMINF_000405ChemSpider identifier
CHEMINF_000406DrugBank identifier
CHEMINF_000407ChEBI identifier
CHEMINF_000408HMDB identifier
CHEMINF_000409KEGG identifier
CHEMINF_000410Wikipedia identifier
CHEMINF_000411Reactome identifier
CHEMINF_000412PubChem identifier

The matching CHEMINF encoding structure looks identical to that for the InChI, by just replacing the CHEMINF_000113 for InChI by the resource for the database. For example, for HMDB and Wikipedia we get:

:Methane rdfs:subClassOf cheminf:CHEMINF_000000 ;
  rdfs:label "methane"@en ;
 cheminf:CHEMINF_000200 [
   a cheminf:CHEMINF_000408 ;
   cheminf:SIO_000300 "HMDB02714" . ] ;
:CHEMINF_000200 [     a cheminf:CHEMINF_000410 ;     cheminf:SIO_000300 "Methane" .   ] .

A full overview of database identifiers is, obviously, available from the ontology itself.

Tuesday, January 22, 2013

ToxBank: the next generation toxicology

Update: the paper should now be freely downloadable.

Before I moved to my current position in Maastricht, I had the great pleasure to work with Prof. Roland Grafström (check his pathway bioinformatics done with his then PhD Rebecca) and Prof. Bengt Fadeel at the Karolinska Institutet. During this year I part-time worked on ToxBank and part-time on nano-QSAR, and worked on semantics, predictive toxicology, and Open Data. This blog post is about the ToxBank work.

I promised firework, and the first rockets are heading upwards: a key ToxBank paper has now been published in Molecular Informatics. Pekka Kohonen wrote up a nice overview of the ToxBank project, the current platform (based on RDF, REST, ISATab, and OpenTox (my archives)), and the test compounds that the SEURAT-1 cluster identified. Various bioinformatics approaches were used to visualize the diversity of the selected compounds. The idea is that the all EU FP7 projects in the SEURAT-1 cluster (consisting of six consortia) will test at least these compounds, creating a rich data set of toxicology-related data for these compounds.

As a temporary, quick solution I proposed the Semantic MediaWiki to create a semantic knowledge base, which was extensively and very productively continued by David from Leadscope. This way, we could easily list all compounds, by doing a search, rather than manually adding them:

Each compound has extensive information on the mode of action, physicochemical properties and more (such as here for acetaminophen):

All this information is available as semantic data. For example, check this link. Network and Gene Ontology analyses on these compounds have been performed, and presented in the paper, further confirming the diversity of the compound set. This leads to possible integration of their work with WikPathways and PathVisio, and I will do my best to get the right people talking to each other.

The ToxBank project further develops Open Source software for an online data warehouse for hosting experimental data on these compounds. A mix of approaches is used here to base their warehouse on, including OpenTox (RDF and REST(-like)-based), ISATab, and various ontologies.

In designing their software, they use a pretty unique approach for EU projects, based on formal requirement analyses protocols, resulting in a user-oriented platform. Now, there is much to say about who the user is, and in fact, there are multiple user types, called personas, and ToxBank takes that idea into account.

Therefore, in many ways, ToxBank is, in my humble but somewhat biased opinion, a project that leads the (predictive) toxicology community into a new era. Congratulations to the full ToxBank consortium! It was great being part of it!

ResearchBlogging.orgKohonen, P., Benfenati, E., Bower, D., Ceder, R., Crump, M., Cross, K., Grafström, R., Healy, L., Helma, C., Jeliazkova, N., Jeliazkov, V., Maggioni, S., Miller, S., Myatt, G., Rautenberg, M., Stacey, G., Willighagen, E., Wiseman, J., & Hardy, B. (2013). The ToxBank Data Warehouse: Supporting the Replacement of In Vivo Repeated Dose Systemic Toxicity Testing Molecular Informatics DOI: 10.1002/minf.201200114

Monday, January 21, 2013

CHEMINF example #1: encoding an InChI

For our Open PHACTS project I am converting commercial data into RDF. That process requires choices to be made. For example, what predicates to use, as outlined in our W3C note, this paper from the HCLS LODD group, or our Open PHACTS RDF Guidelines document.

One of the choices I made for the ChEMBL-RDF is to use CHEMINF (doi:10.1371/journal.pone.0025513). Partly, obviously, as I co-authored the paper, and think the ideas are good. In particular, the more complex encoding of the information allows us more expressiveness, including versioning of identifier schemes, etc.

But, importantly, it defines a nice specification, hoping it will become a standard. I have been using it not just as producer in ChEMBL-RDF, but also as consumer of data, e.g. in  Isbjørn. By adopting specification as a standard, we reduce the amount of work needed to talk to each other: the fewer specifications, the fewer languages we need to support. So, to promote CHEMINF a bit, I will post a series of posts on how to encode particular bits of information, and increasingly, show the power of CHEMINF. New examples will be added to this already existing Examples page.

As a first example, this is how you add an InChI, and I am promoting these examples among the commercial data providers. The VU team informed me that the increased triple complexity has positive effects on the searching, providing more hooks for indexing, I guess (Antonis, is that correct?):

:Methane rdfs:subClassOf cheminf:CHEMINF_000000 ;
  rdfs:label "methane"@en ;
  cheminf:CHEMINF_000200 [
    a cheminf:CHEMINF_000113 ;
    cheminf:SIO_000300 "InChI=1/CH4/h1H4" .
  ] .

This example shows methane. The label and subclass info is not important for adding the InChI, and just here to give the example some body.

Yes, a common objection against CHEMINF is the cryptic predicates, so that needs some explanation. CHEMINF_000200 is 'has attribute' and SIO_000300 is 'has value'. So, we are linking some attribute to the molecule, and give that attribute a value. That is the general pattern.

It leaves us only with the question what that pattern is. But by having the attribute as a resource itself, we can associate further information to that. For example, what kind of attribute it is. In this case, it is an CHEMINF_000113, which is an 'InChI descriptor'.

Now, just to make the point of the versioning, there is a subclass of CHEMINF_000113: CHEMINF_000396 which is an InChI calculated with the 1.04 version of the software. An accurate question would be here, what InChI layers were used(??) ... very well done indeed!