Monday, May 20, 2013

Can I haz thiz PDF, pleaz? #icanhazpdf

Readers of my blog know that I am a promoter of Open Science and have been so for a very long time. One of the main reasons why I am, is that Open just is more efficient. No worry about getting formal approval to access something, faster exchange of knowledge, and people just doing their science.

Situation #1
You published your paper and it is appealing to many people. They all start emailing you for a reprint and you end up spending an hour or two every week answering those emails and attaching the PDF to your replies.

Situation #2
Not that you really have time for it, but don't you just wish you had time to print that really interesting paper to put it on your desk pile of papers to read? Well, no luck, your library had to cancel the subscription. And IBL is just a too expensive approach (and sometimes slow).

Now, because the majority of my papers are Open Access, I never have Situation #1. In fact, I don't think anyone gets such emails a lot. (Please leave a comment if you get more than one such request in half a year.) Situation #2 is much more common, though with access to two university libraries, the journals must be really obscure for me not to have access. But there are additional options to solve Situation #2 before you make an author run into Situation #1.

The upcoming solution to this (well, one of them, there are others) is using #icanhazpdf. Just use this hashtag on social media to let people know you really like to read that paper. The screenshot of a Twitter search shows some examples. This is definitely a faster mechanism to have access to reprint papers.

Now, is this legal? No (unless one of the original authors replies by sending a reprint). Do I recommend it? No. See e.g. this post. Is #icanhazpdf part of Open Science? No. It is a symptom of the limitations in scalability of closed-access publishing. With so many people still dying of things that we are so close to curing, I believe scalability is of critical importance.

Therefore, I rather solve this literature access issue by publishing in (true) Open Access journals. You should too; it really is not that expensive (e.g. comparable to one ACS visit and often cheaper). But #icanhazpdf is an interesting trend, one that I feel you should be aware of. And I blog this, because not everyone seems to know this alternative yet.

BTW, another excellent alternative, particularly in The Netherlands. Visit a friend at another university in the afternoon/evening and spend the rest of the day in the library of the other university to copy (download) those interesting papers. The IBL systems will tell you exactly which journals the other university has. Slower, but comes with the advantage of giving you an excuse to visit your friends elsewhere.

Friday, May 17, 2013

Angelina Jolie's breasts, SNPs, and Jmol

I never had thought I would ever blog about Angelina Jolie's breasts, but they were the news this week that turns out to help me answer a student question for our bioinformatics course. The question was about how to visualize SNPs in 3D protein structures. The link is the protein encoded by BRCA1. Some SNPs in BRCA1 are strongly (cor)related to cancer, as you can read about in OMIM. One 3D protein structure in the is 1JM7, which is a solution based on NMR experiments by Brzovic et al. (doi:10.1038/nsb1001-833).

The Sequence tab for this structure provides us the sequences of the two proteins in this structure, one of which is BRCA1. We can add the SNP annotation with the drop down box:

(BTW, not all structures have SNP information, so this drop down box option is not always present!)

It will show that very many SNPs are known. OMIM needs to be checked to see if they are associated with disease. It should look like this:

You can see the "Single Nucleotide row, with many colored SNPS circles. The first is Met1 and the second is Arg7. This post is, however, on how to visualize the SNPs in Jmol. Now, the Jmol applet is linked to on the website, on the right side in the same Sequence tab:

If you click the "Display Jmol" option, the Jmol applet should fire up (your browser does require the Java plugin). That will create a small dialog hovering over that corner with the Jmol applet. You can right-click on the background and select the Console option:

That will fire up another dialog window, one that will look like the one below and in the lower large section we can write command (yes, some minimal programmaing, in Jmol script!). For example, we can select the methionine at position 1 (the first SNP from BRCA1 as found earlier):

Where select Met1 is typed, you type that too, and press enter. You have now selected 266 atoms as will be written in the top large text area. Then you type the following two further commands, and notice the effect in the Jmol applet window, creating a set of three commands:
  1. select Met1
  2. spacefill on
  3. color green
Then, for each SNP you can pick a different color, and easily visualize a set of SNPs. Now, among all the variants it reports missense SNPs for Cys61 and Cys64. So, the following six commands will visualize these two SNPs, with Cys61 in red, and Cys64 in green:
  1. select Cys61
  2. spacefill on
  3. color red
  4. select Cys64
  5. spacefill on
  6. color green
There are some additonal zinc ions in the structure, which you can color orange with:
  1. select zinc
  2. color orange
This will create this view in Jmol:

So, you can clearly see that the two cysteines affected by the SNPs are involved in the structure's zinc finger. No wonder those mutations give trouble.

Have fun!

Brzovic, P. S., Rajagopal, P., Hoyt, D. W., King, M.-C. & Klevit, R. E. Structure of a BRCA1–BARD1 heterodimeric RING–RING complex. Nature Structural & Molecular Biology 8, 833-837 (2001). URL

Sunday, May 12, 2013

Book Review: "Instant Cytoscape Complex Network Analysis How-to"

Instant Cytoscape Complex Network Analysis How-to (ISBN:978-1-84951-980-9) is a book by Gang Su and reviewed by John Morris, which currently costs less than 10 euro. It is a thin book, with a mere 76 pages of which 58 pages are really informational. Of those pages, a lot of space is taken up by screenshots, which you may or may not like.

In the ten chapters of the book, it does give a concise introduction to Cytoscape. It is clearly written and easy to digest. The chapter titles read as follows, giving you an idea of the topics it covers:

Unboxing Cytoscape (easy), Loading up Cytoscape! (easy), Touch Cytoscape with some style, A network with many forms, Finding needles in a haystack, Tuning up Cytoscape with gadgets, Where are the clusters? (advanced), Additional visualization capabilities, When local is not enough, Export, save, and call it a day!

The qualification of easy and advanced looks somewhat arbitrary. I would guess for many scholars, installing the software may in fact be more difficult than pushing some buttons to do a cluster analysis. Otherwise, the book is targeted at people whit basic programming knowledge, network analysis, and Cytoscape usage. The requirements do not list that MS-Windows is needed, but assumes that is the case. That puts OS/X and Linux users at a disadvantage, particularly with the installation chapter.

The [PACKT] publishers provides the book in various formats, and I mostly read the PDF version. I did have a quick look at the EPUB version. Compared to PDF this format has the advantage the text shown will scale to fit the device, whereas a PDF page is always the same. Unfortunately, this version lost some layout aspects, leading to confusing text, badly shown URLs, and only partly shown screenshots in the Aldiko Book Reader. But because whitespace is one of the important things, I would recommend the PDF. However, you can download it in multiple formats, making this a smaller issue.

Each chapter consists of one or more sections called “Getting started”, “How to do it...”, “How it works...”, “There’s more...”, “See also”. Practically, however, you run through a chapter on one go, perhaps also, because of the (limited) length of chapters. I would very much recommend more of the "There's more..." extended from small hints to full texts, giving the book both more mass and the author more justice.

In fact, compared to tutorials available online, the book mostly provides the convenience of having the information in a nice narrative. However, things that a book should provide too is an index, but that is missing from the book. I know I can use ^F too, but that is not entirely the same thing. An index has more power in guiding the user the the right section of the book. That is, when I browse an index, I see other words the author found important. It is like a table of content, and chapter titles you can also find with ^F after all.

The examples are easy to follow, and assuming Cytoscape 2.8 does not change a lot within one series, we can assume the screenshots and instructions will not change. That is always the risk with a book like this. I hope the book will see frequent extensions and updates.

There were a few things that I had trouble with. One thing is the lack of distinction between Open Source and free of cost. Those are not the same thing. So, rather than writing "It is open source (free of cost)..." I would very much prefer it to read "It is open source and free of cost...". Also, a categorization of MS-Excel as a text editor is awkward, and I hope the next edition will also use OpenOffice/LibreOffice as replacement which is way more in line with the "It is open source..." introduction in the first place.

Content wise, I think the book is fairly solid and it surely covers all the basics one may want to do. A few plugins are mentioned, though they are a bit hidden. I would very much like to see a few plugins appear as separate (advanced?) chapters as show cases for how all that basic functionality is put to use. This is in parallel with the "there is more" sections, which could refer to such chapters. A bit of work, but it would make the book a lot more valuable and go beyond the basic tutorial nature it has now.

All in all, I think the book has a very reasonable price/content ratio and combined with the fact that some environments live by if-it-is-not-in-the-library-it-does-not-exist, I can recommend to keep an eye out for this book. Perhaps wait for the next edition, adding at least an index, but it is an easy and cheap way to support and encourage the Cytoscape community.

Oh, and while composing the below CiteULike widget, I realized that the book lacks references to other literature. For example, there is no reference to the Cytoscape articles. That would be another way to add value over the online tutorials.

Su, G. Instant Cytoscape Complex Network Analysis How-to (2013). ISBN:978-1-84951-980-9.

Saturday, May 11, 2013

Groovy Cheminformatics with the CDK - 8th edition

Update: I have uploaded the full ToC to FigShare.

It is still not what I want to release on as a final version, but fewer and fewer bits are, what I consider, missing. And with a whopping 228 pages I am happy I have a solid build system where I can just hack in new code and update the CDK version easily. In fact, this version adds 10 new scripts since the 7th edition. Some scripts take more time than others, and four of these are solutions I wrote for the Chemistry Toolkit Rosetta (CTR) by Andrew Dalke (see also this blog post).

There are two versions:
  1. paperback, for $ 45
  2. eBook, for $35, a PDF version
Still, this revision does not add that much new content:
  • Section 2.2.2. Bond stereochemistry (just refers to Chapter 3)
  • Chapter 3. Stereochemistry
  • Chapter 19. Chemitry Toolkit Rosetta
  • Section 21.3.2. Grabbing dependencies (for Groovy)
The start of the stereochemistry chapter is interesting, explaining wedge bonds and the ITetrahedralChirality interface, but obviously I have to add the IDoubleBondStereochemistry will have to follow soon.

The first words of Chapter 19 look like what I tumblred earlier today (yes, I have an introductory chapter upcoming but not ready yet, causing the chapter numbering to differ):

The CTR chapter has solutions of four of the challenges listed in the wiki. I have not yet managed to create solutions for all 18 of them, as intended.

I still like to stress that the book is for convenience: it only groups together information that you can find elsewhere as well. For example, the solutions to the four CTR problems can be found in that wiki too. The solution is there, and the explanation is in the book.

CDK 1.4.18: the changes, the authors, and the reviewers

The development of CDK 1.4 is really slowing down, and the CDK 1.5.2 release by John shows how far the development in the master branch has picked up. (BTW, you do know Planet CDK, right? The perfect way to keep up with CDK development!)

This patch does not have a killer fix, but some things of interest. One patch updated the InChI to structure algorithm to also set atomic numbers for atoms, a patch that ensures the "C" and "C" are isomorphic (one of this use cases for which the algorithm was never written to work), hydrogen isotope reading from MDL molfiles, and passing of uncertain double bond stereochemistry to the InChI generator.

In all cases, users of CDK 1.4.x versions are recommended to upgrade to this latest stable version.

The changes
  • Added a missing dependency 6ee0f0a
  • setting unspecified bonds when generating and InChI 631281a
  • unit test for bug1295 724366e
  • Bumped the copyright year to 2013 22ff5a9
  • Unit test to verify that wedge bond information is properly read ea8a316
  • Unit tests for the expected behavior of bug #1294 c3a6452
  • Removed redundant import (not found in class path) bfb2b1b
  • Removed unused imports bc6e1a6
  • Added citation for permutation method 1edc2d5
  • Cleaned up documentation, reworded and added example usage. Update tutorial link to a site which has backed up the now missing web page. d76cac6
  • Fixed proton isotope perception. 421345b
  • replaced cast to implementation with a cast to the interface c273930
  • Small patch for bug 3551478 71f40a9
  • Added additional test that non matching symbols mismatch c532191
  • non-null atomic numbers when reading from InChi [bug:1293] 3117f4c
  • Test to ensure that the atom type name is returned with IAtomType.toString() 471a20c
The authors

As you can see below, this release has a patch by a new contributor, Magda Oprian.

8  Egon Willighagen
8  John May
1  Magda Oprian
1  Stephan Beisken

The reviewers

7  Egon Willighagen 
4  John May 

Friday, May 10, 2013

Linking WikiPathways to binding affinity data

WikiPathways is a long established project (doi:10.1371/journal.pbio.0060184), and while in beta provides a rich resource on various biological processes, in various species. Open PHACTS (Andra particularly) is working on integrating this data. Obvious links include the proteins listed in pathways, which are targeted by compounds in, for example, ChEMBL, one of the resources in the Open PHACTS cache.

However, drugs and drug-like molecules are not abundant in WikiPathways. Some are, like morphine in this metabolism pathway:

However, it seem more common that drug-like compounds are indirectly found pathways. For example, angiotensin-converting enzyme inhibitors (ACE inhibitor, in red below) are mentioned in pathways:

An example ACE inhibitor is captopril. The current BridgeDB-based identifier mapping (doi:10.1186/1471-2105-11-5) will not link the ChEBI identifier for ACE inhibitors (ChEBI:3380) to the matching ChEMBL entry (CHEMBL1560). Thus, we cannot make this link directly.

However, we're in a semantic world, and ontologies come to the rescue, in particular, the ChEBI ontology (doi:10.1093/nar/gkm791). The following screenshot shows how ChEBI says captopril is an ACE inhibitor:

So, if we can only make a claim that any of those listed specific drugs is a ACE inhibitor, then we can use that information to link captopril to the WP557 pathway.

Christian Brenninkmeijer and Alasdair Grey have introduce the concept of lenses in Open PHACTS (see paper below) that define when two things are considered the same. For example, under certain conditions protonation states are the same, while on others not. By turning on and off lenses, you can define what is the same. Lenses can be grouped, and we could have, for example, a "human cell" lens the groups all lenses that details what chemicals are "biologically" identical. For example, because the are readily interconverted under biological (cell) conditions.

So, ChEBI defines a (one-directional) lens that defines captopril is the same as "ACE inhibitor".

Pico, A. R. et al. WikiPathways: Pathway editing for the people. PLoS Biol 6, e184+ (2008). URL
van Iersel, M. et al. The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11, 5+ (2010). URL
Degtyarenko, K. et al. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic acids research 36, D344-D350 (2008). URL
Brenninkmeijer, C. et al. Scientific lenses over linked data: An approach to support task specific views of the data. a vision. In Linked Science 2012 - Tackling Big Data (2012). URL

Thursday, May 09, 2013

New Paper: "The ChEMBL database as linked open data"

Update: Mark wrote up a blog post on the RDF that the ChEMBL team itself.

Yesterday, the paper "The ChEMBL database as linked open data" (doi:10.1186/1758-2946-5-23) by Andra Waagmeester (@andrawaag), Ola Spjuth (@ola_spjuth), Peter Ansell (@p_ansell), Antony Williams (@chemconnector), Valery Tkachenko, Janna Hastings, Bin Chen (@binchenindiana), David J Wild (@davidjohnwild), and me appeared in the OA JChemInf journal.

I am also indebted to the ChEMBL team (@chembl) for both providing such valuable data under a liberal Open Access license and their critical reading of the manuscript! Additionally, I would like to stress that the ChEMBL team will create their own RDF version of ChEMBL and that this paper is not describing the version they will release.

BTW, the source of the paper is available from GitHub. And the (original) scripts to create RDF from the MySQL dump of ChEMBL are also on GitHub.

This paper outlines the RDF as it has evolved from various earlier projects. The above diagram visualizes the basic structure (red), various Linked Data resources linked too (blue) and illustrates how various ontologies are used, such as the CHEMINF, BIBO, and CiTO ontologies.

Additionally, various applications and links are described developed by various co-authors. For example, Peter worked on the use in Bio2RDF and Bin and David on Chem2Bio2RDF. Andra developed an extension for his (#altmetric) CitedIn resource, giving credit to a paper when data in it is extracted into ChEMBL. Ola, Valery, and Anthony developed a Bioclipse Decision Support extension, which supports a nearest neighbor search in ChEMBL using ChemSpider. Of course, Ola also hosts the SPARQL end point of which you can monitor the uptime at the also cool service:

(Yes, I think I have all the cool buzzwords covered in this paper. Sadly, marketing is needed nowadays as a scientist. Where is the time that you could rant on page after page in all your domain specific jargon, not having to worry if your reader would understand it immediately, or without a university degree...)

What this paper does not describe, is all the things I did with ChEMBL-RDF in the Open PHACTS project (@Open_PHACTS), which includes the use of QUDT and the jQUDT library for unit normalization outlined in this document and the use of VoID for link sets as described in this document.
Willighagen, E.;  Waagmeester, A.;  Spjuth, O.;  Ansell, P.;   Williams, A.;  Tkachenko, V.;  Hastings, J.;  Chen, B.;  Wild, D. Journal of Cheminformatics 2013, 5, 23+.

Monday, May 06, 2013

BridgeDB identity mapping file for metabolites

Based on HMDB 3.0 (January data; see the reference at the end), I have updated the identity mapping file for metabolites for BridgeDB. It can be downloaded here as pre-release. I here document how I created this file. I have placed the script online at GitHub so that anyone can look at the code and make custom BridgeDB Derby files too. (For example, for large CAS registry number mapping files.)

The code is some 100 lines long. The input .zip is processed by iterating over the containing XML files:

zipFile.entries().each { entry ->
   if (!entry.isDirectory()) {

     inputStream = zipFile.getInputStream(entry)
     def rootNode = new XmlSlurper().parse(inputStream)

     String rootid = rootNode.accession.toString()

For each file it creates an entry in the BridgeDB data:

  Xref ref = new Xref(rootid, BioDataSource.HMDB);
  error = database.addGene(ref);

  // add the synonyms
    database, ref, "Symbol",
    database, ref, "Synonym",
    database, ref, "Synonym",

  // add the SMILES, InChIKey, etc
    database, ref, "InChIKey",
    database, ref, "SMILES",
    database, ref, "BrutoFormula",
    database, ref, "Taxonomy Parent",

These are some basic properties, and adding the mappings looks like:

  // add external identifiers
    database, ref,
    rootNode.accession.toString(), BioDataSource.HMDB
    database, ref, 
    rootNode.cas_registry_number.toString(), casDS
    database, ref, 
    rootNode.inchi.toString(), inchiDS
    database, ref, 
    rootNode.chemspider_id.toString(), chemspiderDS
    database, ref, 
    rootNode.pubchem_compound_id.toString(), pubchemDS
    database, ref, 
    rootNode.chebi_id.toString(), chebiDS
    database, ref, 
    rootNode.kegg_id.toString(), keggDS
    database, ref, 
    rootNode.wikipedia.toString(), wikipediaDS
    database, ref, 
    rootNode.drugbank_id.toString(), drugbankDS
    database, ref, 
    rootNode.nugowiki.toString(), nugoDS

You can thus see that this new metabolites identity mapping file has mappings to various data sources, including ChemSpider, PubChem, ChEBI, KEGG, Wikipedia, DrugBank, and a few others. All these mappings are provided by HMDB 3.0.

The Derby database file is the format that PathVisio can read natively. This is the magic that is used in the script to create and finalize (aka save) this file:

GdbConstruct database = GdbConstructImpl3.createInstance(
  "hmdb_metabolites", new DataDerby(),


String dateStr = new SimpleDateFormat("yyyyMMdd").
  format(new Date()
database.setInfo("BUILDDATE", dateStr);
database.setInfo("DATASOURCENAME", "HMDB3");
  "hmdb_metabolites_" + dateStr
database.setInfo("DATATYPE", "Metabolite");
database.setInfo("SERIES", "standard_metabolite");

// process the zip file


There are a few more details, but nothing not exposed by the full script.

Wishart, D. S. et al. Nucleic Acids Research 2013, 41, D801-D807.