Monday, May 25, 2015

Bioclipse 2.6.2 with recent hacks #3: using functionality from the OWLAPI

Update: if you had problems installing this feature, please try again. Two annoying issues have been fixed now.

Third in this series is this post about the Bioclipse plugin I wrote for the OWLAPI library. This manager I wrote to learn how the OWLAPI is working. The OWLAPI feature is available from the same update site as the Linked Data Fragments, so you can just follow the steps outlined here (if you had not already).

Using the manager
The manager has various methods, for example, for loading an OWL ontology:

ontology = owlapi.load(
  "/eNanoMapper/enanomapper.owl", null

If your ontology imports other ontologies, you may need to tell the OWLAPI first where to find those, by defining mappings. For example, I could do before making the above call:

mapper = null; // initially no mapper
mapper = owlapi.addMapping(mapper,
mapper = owlapi.addMapping(mapper,
  "" + 

I can list which ontologies have been imported with:

imported = owlapi.getImportedOntologies(ontology)
for (var i = 0; i < imported.size(); i++) {

When the ontology is successfully loaded, I can list the classes and various types of properties:

classes = owlapi.getClasses(ontology)
annotProps = owlapi.getAnnotationProperties(ontology)
declaredProps = owlapi.getPropertyDeclarationAxioms(ontology)

There likely needs some further functionality that needs adding, and I love to hear about what methods you like to see added.

Tuesday, May 19, 2015

New: DOI hyperlinks in the CDK JavaDoc

Apparently I never extended the cdk.cite JavaDoc Taglet to use DOIs from the bibliographic database to create hyperlinks in the JavaDoc. But fear no more! I have submitted a simple patch today to add these to the JavaDoc, and I assume it will be part of the next CDK release from the master branch.

Of course, many papers in this bibliographic database (i.e. this cheminf.bibx file) do not have DOIs for all papers :/

Of course, you can help out here! The only thing you need is a web browser and some knowledge how to look up DOIs for papers. Just check this blog post (from Step 4 onwards) and line 260 in cheminf.bibx to see how a DOI addition to a BibTeXML entry should look like.

Sunday, May 17, 2015

Re: "Thank you for sharing"

CC-BY 4.0, from Roche et al.via Wikipedia.
Nature wrote a piece on data sharing (doi:10.1038/520585a). It remains a tricky area to write about, particularly those terms like public access. Researchers are still a bit shy in sharing data, in some fields more than in others. And for valid reasons. Data sharing is a choice, it is something you do to get something in return. The return you get on your investment can vary, for example:
  1. goodwill (e.g. from your employer or funder)
  2. others will donate data to the same resource to benefit your research (a research needs some critical mass)
  3. it can be enjoyable
  4. the repository where you contribute your data adds value (e.g. by linking to other resources)
  5. others can find your data more easily, leading to more citations of your publications
  6. after using Open Data for yourself (e.g., you like to return a favor
I probably miss a few. On the other hand, you may miss out on other opportunities. For example, your data could have been part of an IP-based business model. For example, you are the only one to be able to use that data to solve/answer questions.

As said, there are many good and valid reasons for either option. It is an option, it is a choice.

The Nature News article has this lead that misled me:
    Initiatives to make genetic and medical data publicly available could improve diagnostics — but they lose value if they do not share with other projects.
The article, however, then discusses a few mechanisms use for data sharing, but I could not spot one that had anything to do with "publicly available". So, I left this comment with the editorial and with PubMed Commons:
    Like Open Access, "sharing" is a meaningless term if it is not linked to meaningful rights. The problems outlined in this paper result from the fact that their may be a wish to share data but only if it allows you to take back the data. Private, custom data licenses do just that. There is nothing wrong with this kind of sharing, but it must not be confused with Open Data. It must not be confounded with terms like "publicly available", because if it needs a signature, it's not publicly available. That makes the lead of this article quite misleading.
    For public or open data, three basic rights are part of the social agreement between the data owner (yes, fact in many countries; database rights, etc) and data user. These rights are: 1. make a copy, 2. make modifications, and 3. reshare (under the same conditions). By using a license (or waiver) that gives this rights automatically to the receiver, then there is no need for signatures. It also allows for anyone to make the mappings that are required to convert one format into another.
BTW, the image I used in this post is from a paper from Roche et al. of about a year ago (doi:10.1371/journal.pbio.1001779). I have not read that one yet, but looks like an interesting read too, just like the Nature editorial.

, Apr. 2015. Thank you for sharing. Nature 520 (7549), 585. URL

Roche, D. G., Lanfear, R., Binning, S. A., Haff, T. M., Schwanz, L. E., Cain, K. E., Kokko, H., Jennions, M. D., Kruuk, L. E. B., Jan. 2014. Troubleshooting public data archiving: Suggestions to increase participation. PLoS Biol 12 (1), e1001779+. URL

Saturday, May 16, 2015

Bioclipse 2.6.2 with recent hacks #2: reading content from online Google Spreadsheets

Update 2015-06-04: the authentication with the Google Drive has changed; I need to update the code and am afraid I missed the point, so that the below code is not working right now :(

Similar to the previous post in this new series, this post will outline how to make use of the Google Spreadsheet functionality in Bioclipse 2.6.2. But before I provide the steps needed to install the functionality, first consider this Bioclipse JavaScript:

    "your.account", "16charpassword"
    "ORCID @ Maastricht University"
  data = google.loadWorksheet(
    "ORCID @ Maastricht University",
    "with works"

Because that's what this functionality: read data from Google Spreadsheets. That opens up an integration of Google Spreadsheets with your regular data analysis workflows. I am not sure of Bioclipse is the only tool that embeds the Google client code to access these services, and can imagine similar functionality is available from R, Taverna, and KNIME.

Getting your credentials
The first call to the google manager requires your login details. But don't use your regular password: you need a application password. This specific, sixteen character, password needs to be manually created using your webbrowser, following this link. Create a new App password (”Other (Customized name)” ) and use this password in Bioclipse.

Installing Bioclipse 2.6.2 and the Google Spreadsheet functionality
The first you need to do (unless you already did that, of course) is install Bioclipse 2.6.2 (the beta) and enable the advanced mode. This is outline in my previous post up to Step 1. The update site, obviously, is different, and in Step 2 in that post you should use:

  1. Name: Open Notebook Science Update Site
  2. Location:
Yes, the links only seem to get longer and longer. Just continue to the next step and install the Google Feature:

That's it, have fun!

Oh, and this hack is not so recent. I wrote the first version of the plugin and matching manager, as used in the above code, dates back to January 2011, when I had just started at the Karolinska Institutet. But the code to download data from spreadsheets is even older, and goes back to 2008 when I worked with Cameron Neylon and Pierre Lindenbaum on creating RDF for data being collected by Jean Claude-Bradley. If you're interested, check the repository history and this book chapter.

Friday, May 15, 2015

CDK Literature #6

Originally a series I started in the CDK News, later for some issues part of this blog, and then for some time on Google+, CDK Literature is now returning to my blog. BTW, I created a poll about whether CDK News should be picked up again. The reason why we stopped was that we were not getting enough submissions anymore.

For those who are not familiar with the CDK Literature series, the posts discuss recent literature that cites one of the two CDK papers (the first one is now Open Access). A short description explains what the paper is about and why the CDK is cited. For that I am using the CiTO, of which the data is available from CiteULike. That allows me to keep track how people are using the CDK, resulting, for example, in these wordles.

I will try to pick up this series again, but may be a bit more selective. The number of CDK citing papers has grown extensively, resulting in at least one new paper each week (indeed, not even close to the citation rate of DAVID). I aim at covering ~5 papers each week.

Ring perception
Ring perception has evolved in the CDK. Originally, there was the Figueras algorithm (doi:10.1021/ci960013p) implementation which was improved by Berger et al. (doi:10.1007/s00453-004-1098-x). Now, John May (the CDK release manager) has reworked the ring perception in the CDK, also introduction a new API which I covered recently. Also check John's blog.

May, J. W., Steinbeck, C., Jan. 2014. Efficient ring perception for the chemistry development kit. Journal of Cheminformatics 6 (1), 3+. URL

Screening Assistant 2
A bit longer ago, Vincent Le Guilloux published the second version their Screening Assistant tool fo rmining large sets of compounds. The CDK is used for various purposes. The paper is already from 2012 (I am that much behind with this series) and the source code on SourceForge does not seem to have change much recently.

Figure 2 of the paper (CC-BY) shows an overview of the Screening Assistant GUI.
Guilloux, V. L., Arrault, A., Colliandre, L., Bourg, S., Vayer, P., Morin-Allory, L., Aug. 2012. Mining collections of compounds with screening assistant 2. Journal of Cheminformatics 4 (1), 20+. URL

Similarity and enrichment
Using fingerprints for compound enrichment, i.e. finding the actives in a set of compounds, is a common cheminformatics application. This paper by Avram et al. introduces a new metric (eROCE). I will not go into details, which are best explained by the paper, but note that the CDK is used via PaDEL and that various descriptors and fingerprints are used. The data set they used to show the performance is one of close to 50 thousand inhibitors of ALDH1A1.

Avram, S. I., Crisan, L., Bora, A., Pacureanu, L. M., Avram, S., Kurunczi, L., Mar. 2013. Retrospective group fusion similarity search based on eROCE evaluation metric. Bioorganic & Medicinal Chemistry 21 (5), 1268-1278. URL

The International Chemical Identifier
It is only because Antony Williams advocated the importance of the InChI in this excellent slides that I list this paper again: I covered it here in more detail already. The paper describes work by Sam Adams to wrap the InChI library into a Java library, how it is integrated in the CDK, and how Bioclipse uses it. It does not formally cite the CDK, which now feels silly. Perhaps I did not add because of fear of self-citation? Who knows. Anyway, you find this paper cited on slide 30 in aforementioned presentation from Tony.

Spjuth, O., Berg, A., Adams, S., Willighagen, E., 2013. Applications of the InChI in cheminformatics with the CDK and bioclipse. Journal of Cheminformatics 5 (1), 14+. URL

Predictive toxicology
Cheminformatics is a key tool in predictive toxicology. I starts with the assumption that compounds of similar structure, behave similarly when coming in contact with biological systems. This is a long-standing paradigm which turns out to be quite hard to use, but has not shown to be incorrect either. This paper proposes a new approach using Pareto points and used the CDK to calculate logP values for compounds. However, I cannot find which algorithm it is using to do so.

Palczewska, A., Neagu, D., Ridley, M., Mar. 2013. Using pareto points for model identification in predictive toxicology. Journal of Cheminformatics 5 (1), 16+. URL

Cheminformatics in Python
ChemoPy is a tool to do cheminformatics in Python. This paper cites the CDK just as one of the tools available for cheminformatics. The tool is available from Google Code. It has not been migrated yet, but they still have about half a year to do so. Then again, given that there does not seem to have been activity since 2013, I recommend looking at Cinfony instead (doi:10.1186/1752-153X-2-24): exposed the CDK and is still maintained.

Cao, D.-S., Xu, Q.-S., Hu, Q.-N., Liang, Y.-Z., Apr. 2013. ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29 (8), 1092-1094. URL

Groovy Cheminformatics with the CDK - 11th edition

It's been a while since I blogged about a release of my "Groovy Cheminformatics with the CDK" book, but not too long ago I made another release, 1.5.10-0. This was also the first one with white paper, and updated for the latest CDK development release.

There are two versions (and always check the special deals, e.g. today you can use UNPLUG10 to get an additional 10% off the below prices):
  1. paperback, for $25
  2. eBook, for $15, a PDF version
Compared to the 8th edition, this version offers this new material:
  • Chapter 1: Cheminformatics
  • Section 13.3: Ring counts (though it is not updated for John's ring perception work, doi:10.1186/1758-2946-6-3)
  • Section 14.1: Element and Isotope information
  • Section 16.4: SMARTS matching
  • Chapter 20: four more Chemistry Toolkit Rosetta solutions
  • Section 24.1: CDK 1.4 to 1.6 (see also this series)
This version of the book has 204 Groovy scripts, all of which have been tested against CDK 1.5.10.

Sunday, May 03, 2015

Pathways as summaries: Nature Review Disease Primers and Open Source Malaria

A P.falciparum isoprenoid
biosynthesis pathway (WP2918).
Event 1
The Nature Publishing Group (NPG) has launched a new journal, which you probably did not miss. There is founding editorial titles From mechanisms to management (doi:10.1038/nrdp.2015.1) as the goal of the journal. Very noble and very needed, indeed! They write:
Each Primer article includes the same major sections: epidemiology, mechanisms, pathophysiology, diagnosis, screening, prevention, management and patient quality of life.
The complement the articles with PrimerViews and even animations:
Together, we hope that the Primer and PrimeView will provide readily accessible introductions to each topic for readers from all disciplines.
Very exciting! The mechanistic diagrams in the papers are perhaps even better, but, it wouldn't be a proper chem-bla-ics post had I not something to bitch about. And I do; read on.

Event 2
This weekend Christopher Southan asked if the Plasmodium falciparum pathway for isoprenoid biosynthesis was to be found in WikiPathways (related to this blog post about MMV008138). It was not at the time. But other resources did, including literature (of course), Wikipedia, and the excellent Malaria Parasite Metabolic Pathways resource.

In related news, about a year ago, Patricia Zaandam worked in our group on pathway analysis related to malaria. At the time, we selected human data from ArrayExpress because of the abundance of human pathways in WikiPathways (>600 now, of which the Curated Collection and Reactome Approved are subsets). So, on a weekend where I really needed a break from working and with some time free, I decided to make that pathway. One of the first observations was that you cannot create Plasmodium pathways on WikiPathways yet. Second, we also do not have a BridgeDb gene identifier mapping database for this organism either. But that is not needed for drawing the pathway.

So, I am digitizing the pathway from the various sources that I can find, added MMV008138, and will probably add more malaria drugs and drug leads along the way. The idea of the project of Patricia last year was indeed possible drug targets. This resulted in this current outcome (with MMV008138 highlighted in red):

The new NPG journal realized we need high quality summaries, and they are correct. This is why the periodic table of elements has been so useful, and the purpose of physical laws expressed as mathematical equations: it puts emphasis on what we think matters. This is also why I believe WikiPathways is so important.

But that's where the parallel between WikiPathways and NatRevDiseasePrimers about ends. The goal of WikiPathways is not just to summarize the knowledge, but to make it manageable. We are talking about data management here. I don't care that much about nice graphics; if we really want to make the science and the industry going forward, then we cannot hide behind a knowledge publishing system that doesn't scale and that doesn't integrate. That is not the kind of management we need.

New readers of my blog - welcome! - can browse my past writings to read what the publishing industry should have done. I have explored many different solutions, and only few of them are being picked up. The Nature Publishing Group has repeatedly experimented with new technologies to make the flood of knowledge manageable, and it find it rather disappointing that this editorial does not manage to go beyond nice graphics. I hope the journal will quickly pick up speed, and add the missing machine readability and APIs. Because a new journal is for years, and we really cannot wait another 15 years.

I am not claiming this new journal is not useful, but it could have been so much more.