Pages

Tuesday, May 19, 2015

New: DOI hyperlinks in the CDK JavaDoc

Apparently I never extended the cdk.cite JavaDoc Taglet to use DOIs from the bibliographic database to create hyperlinks in the JavaDoc. But fear no more! I have submitted a simple patch today to add these to the JavaDoc, and I assume it will be part of the next CDK release from the master branch.

Of course, many papers in this bibliographic database (i.e. this cheminf.bibx file) do not have DOIs for all papers :/

Of course, you can help out here! The only thing you need is a web browser and some knowledge how to look up DOIs for papers. Just check this blog post (from Step 4 onwards) and line 260 in cheminf.bibx to see how a DOI addition to a BibTeXML entry should look like.

Sunday, May 17, 2015

Re: "Thank you for sharing"

CC-BY 4.0, from Roche et al.via Wikipedia.
Nature wrote a piece on data sharing (doi:10.1038/520585a). It remains a tricky area to write about, particularly those terms like public access. Researchers are still a bit shy in sharing data, in some fields more than in others. And for valid reasons. Data sharing is a choice, it is something you do to get something in return. The return you get on your investment can vary, for example:
  1. goodwill (e.g. from your employer or funder)
  2. others will donate data to the same resource to benefit your research (a research needs some critical mass)
  3. it can be enjoyable
  4. the repository where you contribute your data adds value (e.g. by linking to other resources)
  5. others can find your data more easily, leading to more citations of your publications
  6. after using Open Data for yourself (e.g. pdb.org), you like to return a favor
I probably miss a few. On the other hand, you may miss out on other opportunities. For example, your data could have been part of an IP-based business model. For example, you are the only one to be able to use that data to solve/answer questions.

As said, there are many good and valid reasons for either option. It is an option, it is a choice.

The Nature News article has this lead that misled me:
    Initiatives to make genetic and medical data publicly available could improve diagnostics — but they lose value if they do not share with other projects.
The article, however, then discusses a few mechanisms use for data sharing, but I could not spot one that had anything to do with "publicly available". So, I left this comment with the editorial and with PubMed Commons:
    Like Open Access, "sharing" is a meaningless term if it is not linked to meaningful rights. The problems outlined in this paper result from the fact that their may be a wish to share data but only if it allows you to take back the data. Private, custom data licenses do just that. There is nothing wrong with this kind of sharing, but it must not be confused with Open Data. It must not be confounded with terms like "publicly available", because if it needs a signature, it's not publicly available. That makes the lead of this article quite misleading.
    For public or open data, three basic rights are part of the social agreement between the data owner (yes, fact in many countries; database rights, etc) and data user. These rights are: 1. make a copy, 2. make modifications, and 3. reshare (under the same conditions). By using a license (or waiver) that gives this rights automatically to the receiver, then there is no need for signatures. It also allows for anyone to make the mappings that are required to convert one format into another.
BTW, the image I used in this post is from a paper from Roche et al. of about a year ago (doi:10.1371/journal.pbio.1001779). I have not read that one yet, but looks like an interesting read too, just like the Nature editorial.

, Apr. 2015. Thank you for sharing. Nature 520 (7549), 585. URL http://dx.doi.org/10.1038/520585a

Roche, D. G., Lanfear, R., Binning, S. A., Haff, T. M., Schwanz, L. E., Cain, K. E., Kokko, H., Jennions, M. D., Kruuk, L. E. B., Jan. 2014. Troubleshooting public data archiving: Suggestions to increase participation. PLoS Biol 12 (1), e1001779+. URL http://dx.doi.org/10.1371/journal.pbio.1001779

Saturday, May 16, 2015

Bioclipse 2.6.2 with recent hacks #2: reading content from online Google Spreadsheets

Similar to the previous post in this new series, this post will outline how to make use of the Google Spreadsheet functionality in Bioclipse 2.6.2. But before I provide the steps needed to install the functionality, first consider this Bioclipse JavaScript:

  google.setUserCredentials(
    "your.account", "16charpassword"
  )
  google.listSpreadsheets()
  google.listWorksheets(
    "ORCID @ Maastricht University"
  )
  data = google.loadWorksheet(
    "ORCID @ Maastricht University",
    "with works"
  )

Because that's what this functionality: read data from Google Spreadsheets. That opens up an integration of Google Spreadsheets with your regular data analysis workflows. I am not sure of Bioclipse is the only tool that embeds the Google client code to access these services, and can imagine similar functionality is available from R, Taverna, and KNIME.

Getting your credentials
The first call to the google manager requires your login details. But don't use your regular password: you need a application password. This specific, sixteen character, password needs to be manually created using your webbrowser, following this link. Create a new App password (”Other (Customized name)” ) and use this password in Bioclipse.

Installing Bioclipse 2.6.2 and the Google Spreadsheet functionality
The first you need to do (unless you already did that, of course) is install Bioclipse 2.6.2 (the beta) and enable the advanced mode. This is outline in my previous post up to Step 1. The update site, obviously, is different, and in Step 2 in that post you should use:

  1. Name: Open Notebook Science Update Site
  2. Location: http://pele.farmbio.uu.se/jenkins/job/Bioclipse.ons/lastSuccessfulBuild/artifact/buckminster.output/net.bioclipse.ons_site_1.0.0-eclipse.feature/site.p2/
Yes, the links only seem to get longer and longer. Just continue to the next step and install the Google Feature:


That's it, have fun!

Oh, and this hack is not so recent. I wrote the first version of the net.bioclipse.google plugin and matching manager, as used in the above code, dates back to January 2011, when I had just started at the Karolinska Institutet. But the code to download data from spreadsheets is even older, and goes back to 2008 when I worked with Cameron Neylon and Pierre Lindenbaum on creating RDF for data being collected by Jean Claude-Bradley. If you're interested, check the repository history and this book chapter.

Friday, May 15, 2015

CDK Literature #6

Originally a series I started in the CDK News, later for some issues part of this blog, and then for some time on Google+, CDK Literature is now returning to my blog. BTW, I created a poll about whether CDK News should be picked up again. The reason why we stopped was that we were not getting enough submissions anymore.

For those who are not familiar with the CDK Literature series, the posts discuss recent literature that cites one of the two CDK papers (the first one is now Open Access). A short description explains what the paper is about and why the CDK is cited. For that I am using the CiTO, of which the data is available from CiteULike. That allows me to keep track how people are using the CDK, resulting, for example, in these wordles.

I will try to pick up this series again, but may be a bit more selective. The number of CDK citing papers has grown extensively, resulting in at least one new paper each week (indeed, not even close to the citation rate of DAVID). I aim at covering ~5 papers each week.

Ring perception
Ring perception has evolved in the CDK. Originally, there was the Figueras algorithm (doi:10.1021/ci960013p) implementation which was improved by Berger et al. (doi:10.1007/s00453-004-1098-x). Now, John May (the CDK release manager) has reworked the ring perception in the CDK, also introduction a new API which I covered recently. Also check John's blog.

May, J. W., Steinbeck, C., Jan. 2014. Efficient ring perception for the chemistry development kit. Journal of Cheminformatics 6 (1), 3+. URL http://dx.doi.org/10.1186/1758-2946-6-3

Screening Assistant 2
A bit longer ago, Vincent Le Guilloux published the second version their Screening Assistant tool fo rmining large sets of compounds. The CDK is used for various purposes. The paper is already from 2012 (I am that much behind with this series) and the source code on SourceForge does not seem to have change much recently.

Figure 2 of the paper (CC-BY) shows an overview of the Screening Assistant GUI.
Guilloux, V. L., Arrault, A., Colliandre, L., Bourg, S., Vayer, P., Morin-Allory, L., Aug. 2012. Mining collections of compounds with screening assistant 2. Journal of Cheminformatics 4 (1), 20+. URL http://dx.doi.org/10.1186/1758-2946-4-20

Similarity and enrichment
Using fingerprints for compound enrichment, i.e. finding the actives in a set of compounds, is a common cheminformatics application. This paper by Avram et al. introduces a new metric (eROCE). I will not go into details, which are best explained by the paper, but note that the CDK is used via PaDEL and that various descriptors and fingerprints are used. The data set they used to show the performance is one of close to 50 thousand inhibitors of ALDH1A1.

Avram, S. I., Crisan, L., Bora, A., Pacureanu, L. M., Avram, S., Kurunczi, L., Mar. 2013. Retrospective group fusion similarity search based on eROCE evaluation metric. Bioorganic & Medicinal Chemistry 21 (5), 1268-1278. URL http://dx.doi.org/10.1016/j.bmc.2012.12.041

The International Chemical Identifier
It is only because Antony Williams advocated the importance of the InChI in this excellent slides that I list this paper again: I covered it here in more detail already. The paper describes work by Sam Adams to wrap the InChI library into a Java library, how it is integrated in the CDK, and how Bioclipse uses it. It does not formally cite the CDK, which now feels silly. Perhaps I did not add because of fear of self-citation? Who knows. Anyway, you find this paper cited on slide 30 in aforementioned presentation from Tony.

Spjuth, O., Berg, A., Adams, S., Willighagen, E., 2013. Applications of the InChI in cheminformatics with the CDK and bioclipse. Journal of Cheminformatics 5 (1), 14+. URL http://dx.doi.org/10.1186/1758-2946-5-14

Predictive toxicology
Cheminformatics is a key tool in predictive toxicology. I starts with the assumption that compounds of similar structure, behave similarly when coming in contact with biological systems. This is a long-standing paradigm which turns out to be quite hard to use, but has not shown to be incorrect either. This paper proposes a new approach using Pareto points and used the CDK to calculate logP values for compounds. However, I cannot find which algorithm it is using to do so.

Palczewska, A., Neagu, D., Ridley, M., Mar. 2013. Using pareto points for model identification in predictive toxicology. Journal of Cheminformatics 5 (1), 16+. URL http://dx.doi.org/10.1186/1758-2946-5-16

Cheminformatics in Python
ChemoPy is a tool to do cheminformatics in Python. This paper cites the CDK just as one of the tools available for cheminformatics. The tool is available from Google Code. It has not been migrated yet, but they still have about half a year to do so. Then again, given that there does not seem to have been activity since 2013, I recommend looking at Cinfony instead (doi:10.1186/1752-153X-2-24): exposed the CDK and is still maintained.

Cao, D.-S., Xu, Q.-S., Hu, Q.-N., Liang, Y.-Z., Apr. 2013. ChemoPy: freely available python package for computational biology and chemoinformatics. Bioinformatics 29 (8), 1092-1094. URL http://dx.doi.org/10.1093/bioinformatics/btt105

Groovy Cheminformatics with the CDK - 11th edition

It's been a while since I blogged about a release of my "Groovy Cheminformatics with the CDK" book, but not too long ago I made another release, 1.5.10-0. This was also the first one with white paper, and updated for the latest CDK development release.

There are two versions (and always check the special deals, e.g. today you can use UNPLUG10 to get an additional 10% off the below prices):
  1. paperback, for $25
  2. eBook, for $15, a PDF version
Compared to the 8th edition, this version offers this new material:
  • Chapter 1: Cheminformatics
  • Section 13.3: Ring counts (though it is not updated for John's ring perception work, doi:10.1186/1758-2946-6-3)
  • Section 14.1: Element and Isotope information
  • Section 16.4: SMARTS matching
  • Chapter 20: four more Chemistry Toolkit Rosetta solutions
  • Section 24.1: CDK 1.4 to 1.6 (see also this series)
This version of the book has 204 Groovy scripts, all of which have been tested against CDK 1.5.10.