Pages

Friday, July 31, 2015

WikiPathways and two estrone-x,y-quinones added to Wikidata

WikiPathways does a lot of curation, with a team growing in size. A number of regular jobs is performed weekly by one of a group of some 15-20 curators. On top of that, some curators do much more than this weekly task, e.g. Kristina Haspers. Since I joined the BiGCaT team of Chris Evelo in Maastricht, I have been looking into the metabolites and other small molecules, and did quite a bit of work to make that information machine readable. See, for example, these open notebook science posts.

This curation is partly supported by tools, e.g. bots and tests. Tests are, among others, being run nightly on a Jenkins instance (in various configurations). One of the bots create this report, which Martina Kutmon recently reminded me of. Starting at the end of that, I started browsing it for unrecognized metabolites (for various reasons). My eyes fell on two compounds in the estrogen metabolism pathway, originally created by Pieter Giesbertz: estrone-2,3-quinone and estrone-3,4-quinone (in green):


The website was not showing up mappings to other database for the cross-references from PubChem. A quick check confirmed that HMDB, KEGG and ChEBI did not have this compound. HMDB has an entry for one of the compounds, given the name, but the chemical graph has undefined stereochemistry. That certainly explains why it did not map to the PubChem compound ID. And, indeed, PubChem does have the HMDB as substance, but not linked to a compound. So, I added them to Wikidata: Q20739847 and Q20742851.


Then, when I make the next metabolite ID mapping database for BridgeDb, it will have mappings between the cross-references in WikiPathways for these two compounds to, at the time of writing, ChemSpider, and to the CAS registry number of one of the two. Please also note that Wikidata allowed me to store the information source.

Thus, for me, Wikidata is the place to add new mappings, and I herald work by Andra Waagmeester, Andrew Su, and others to use Wikidata for this kind of purpose. If you agree, you can add your support here.

Wednesday, July 15, 2015

PubChemRDF: semantic web access to PubChem data

Gang Fu and Evan Bolton have blogged about it previously, but their PubChemRDF paper is out now (doi:10.1186/s13321-015-0084-4). It very likely defines the largest collection of RDF triples using the CHEMINF ontology and I congratulate the authors with a increasingly powerful PubChem database.

With this major provider of Linked Open Data for chemistry now published, I should soon see where my Isbjørn stands. The release of this publication is also very timely with respect to the CHEMINF ontology, as I last week finished a transition from Google to GitHub, by moving the important wiki pages, including one about "Where is the CHEMINF ontology used?". I already added Gang's paper. A big thanks and congratulations to the PubChem team and my sincere thanks to have been able to contribute to this paper.

Sunday, July 12, 2015

CDK Literature #9

Visualization of functional groups.
Public domain from Wikipedia.
In the past 50 years we have been trying to understand why certain classes of compounds show the same behavior. Quantum chemical calculations are still getting cheaper and easier (though, I cannot point you to a review of recent advances), but it has not replaced other approaches, as is visible in the number of QSAR/descriptor applications of the CDK.

Functional Group Ontology
Sankar et al. have developed an ontology for functional groups (doi:10.1016/j.jmgm.2013.04.003). One popular thought is that subgroups of atoms are more important than the molecule as a whole. Much of our cheminformatics is based on this idea. And it matches what we experimentally observe. If we add a hydroxyl or an acid group, the molecule becomes more hydrophylic. Semantically encoding this clearly important information seems important, though intuitively I would have left this to the cheminformatics tools. This paper and a few cited papers, however, show far you can take this. It organizes more than 200 functional groups, but I am not sure where the ontology can be downloaded.

Sankar, P., Krief, A., Vijayasarathi, D., Jun. 2013. A conceptual basis to encode and detect organic functional groups in XML. Journal of Molecular Graphics and Modelling 43, 1-10. URL http://dx.doi.org/10.1016/j.jmgm.2013.04.003

Linking biological to chemical similarities
If we step aside from our concept of "functional group", we can also just look at whatever is similar between molecules. Michael Kuhn et al. (of STITCH and SIDER) looked into the role of individual proteins in side effect (doi:10.1038/msb.2013.10). They find that many drug side effects are mediated by a selection of individual proteins. The study uses a drug-target interaction data set, and to reduce the change of bias due to some compound classes more extensively studies (more data), they removed too similar compounds from the data set, using the CDK's Tanimoto stack.

Kuhn, M., Al Banchaabouchi, M., Campillos, M., Jensen, L. J., Gross, C., Gavin, A. C., Bork, P., Apr. 2014. Systematic identification of proteins that elicit drug side effects. Molecular Systems Biology 9 (1), 663. URL http://dx.doi.org/10.1038/msb.2013.10

Drug-induced liver injury
These approaches can also be used to study if there are structural reasons why Drug-induced liver injury (DILI) occurs. This was studied in this paper Zhu et al. where the CDK is used to calculate topological descriptors (doi:10.1002/jat.2879). They compared explanatory models that correlate descriptors with the measured endpoint and a combination with hepatocyte imaging assay technology (HIAT) descriptors. These descriptors capture phenotypes such as nuclei count, nuclei area, intensities of reactive oxygen species intensity, tetramethyl rhodamine methyl ester, lipid intensity, and glutathione. It doesn't cite any of the CDK papers, so I left a comment with PubMed Commons.

Zhu, X.-W., Sedykh, A., Liu, S.-S., Mar. 2014. Hybrid in silico models for drug-induced liver injury using chemical descriptors and in vitro cell-imaging information. Journal of Applied Toxicology 34 (3), 281-288. URL http://dx.doi.org/10.1002/jat.2879

PubMed Commons: comments, pointers, questions, etc

I could have sworn I had blogged about this already, but cannot find it in my blog archives. If you do not know PubMed Commons yet, check it out! As the banner on the right shows, they're in Pilot mode (yeah, why stick to alpha/beta release tagging), and it already found several uses, as explain in this blog post. Journal clubs is one of them, which they introduced at the end of last year. The pilot started out with giving access to PubMed authors, but since many of us are, that was never really a reason not to give it a try. Comments on PubMed Commons automatically get picked up by other platforms, like PubPeer, and commentators get a profile page, this is mine.

Like the use cases people have adopted - see the above linked blog post - I have found a number of use cases:

  1. additional information:
    1. missing citations (1)
    2. where data can be downloaded (1)
  2. where data from that paper was deposited:
    1. paper figures available in WikiPathways (1,2,3,4)
    2. authors uploaded data/figures to FigShare but the paper does not link it (1)
    3. authors uploaded data/figures to DataDryad but the paper does not link it (1)
  3. me too:
    1. CDK can help (1)
  4. commenting (1) and questions (2)
  5. a closed paper was made gold Open Acces (1)
  6. the source code behind that paper moved
    1. from Google Code to GitHub (1)
So, get your account today, and start updating your papers which changed locations. Because we all now the bit rot in website locations in papers. Show PubMed how you like to improve scientific communication via the publishing platform!

Saturday, July 11, 2015

CDK Literature #8

Tool validation
The first paper this week is a QSAR paper. In fact, it does some interesting benchmarking of a few tools with a data set of about 6000 compounds. It includes looking into the applicability domain, and studies the error of prediction for compounds inside and outside the chemical space defined by the training set. The paper indirectly uses the CDK descriptor calculation corner, by using EPA's T.E.S.T. toolkit (at least one author, Todd Martin, contributed to the CDK).

Callahan, A., Cruz-Toledo, J., Dumontier, M., Apr. 2013. Ontology-Based querying with Bio2RDF's linked open data. Journal of Biomedical Semantics 4 (Suppl 1), S1+. URL http://dx.doi.org/10.1186/2041-1480-4-s1-s1

Tetranortriterpenoid
Arvind et al. study tetranortriterpenoids using a QSAR approach involving COMFA and the CPSA descriptor (green OA PDF). The latter CDK descriptor is calculated using Bioclipse. The study finds that using compound classes can improve the regression.

Arvind, K., Anand Solomon, K., Rajan, S. S., Apr. 2013. QSAR studies of tetranortriterpenoid: An analysis through CoMFA and CPSA parameters. Letters in Drug Design & Discovery 10 (5), 427-436. URL http://dx.doi.org/10.2174/1570180811310050010

Accurate monoisotopic masses
Another useful application of the CDK is the Java wrapping of the isotope data in the Blue Obelisk Data Repository (BODR). Mareile Niesser et al. use Rajarshi's rcdk package for R to calculate the differences in accurate monoisotopic masses. They do not cite the CDK directly, but do mention it by name in the text.

Niesser, M., Harder, U., Koletzko, B., Peissner, W., Jun. 2013. Quantification of urinary folate catabolites using liquid chromatography–tandem mass spectrometry. Journal of Chromatography B 929, 116-124. URL http://dx.doi.org/10.1016/j.jchromb.2013.04.008