Monday, March 26, 2012

BiGCaT: Protein Structure

The last two weeks I have been busy with a course I gave last week (2x two hours lectures, 2x three hours practical) on Protein Structure. I had slides from last year at my disposal, but added some stuff, including a pointer to Google Play's page for the Jmol Android App (cool stuff!). The content involved the basic links between primary, secondary, tertiary, and quarternary structure.

At Maastricht University we use problem-based learning, and the topics for the students of this block (two bachelor studies) were aging and immunology. Interesting, because as a chemist I have limited background in these fields. Other content in this bioinformatics course done by our BiGCaT group involved SNPs, and I found a few interesting examples. As protein structure visualization tool, the course is using Yasara, a great tool, but next year I will prefer to use Bioclipse.

For the aging topic, I found a protein encoded by the WRN gene involved in Werner's Syndrome. I do not believe they found the actual cause of this disease yet, but that just makes it more interesting. The 3AAF structure in the PDB database (which, BTW, gets really slow when the USA wakes up), is associated with two SNPs:

So, it wasn't more than logical that I had the students visualize the residues and ask them to hypothesize which of the two was more likely to affect DNA binding:

Also very interesting to look at in this structure, is the π-π stacking by which this protein structure binds to the DNA:

OK, you really need to look at this yourself in 3D, but you see the Tyr and Phe binding to the DNA bases here :)

So, next year, one of these exercises will basically become a Bioclipse plugin with a nice cheat sheet. I also plan to finally write that plugin that educates people in the Jmol scripting language. Who wants to team up for that?

The other students got to look at the 3O4L structure (thanx to Patrick for the great topic!), which is a beautiful complex of protein structures from a normal cell (green and blue) and a T-cell (yellow and red):

In the middle is a small peptide (element colors), broken down by the normal cell from a Epstein-Barr virus. The membrane-bound protein parts are missing, but if you just imagine two huge cells on the left and right of this!

The red and yellow part are very similar structures, with a really nice S-S bond holding together the two subunits (all S-S bonds are depicted in magenta):

And this simple complex has many variants on the T-cell receptor side, allowing our human body to adapt to all sorts of viruses, etc. This is what life looks like! Isn't chemistry amazing!

Thursday, March 22, 2012

CDK 1.4.8: the changes, the authors, and the reviewers

Mmmm... when I pasted the list of changes in the CDK 1.4.8 release (download it here), I was surprised by the length. Then again, the 1.4.7 release was almost three months ago.

However, a careful look at the list shows that quite a few are JavaDoc fixes, additional unit testing in the inchi module, and other small tuning. Of more much more interest are the fix in the SDF parser, which failed to continue reading when it encountered a broken molfile entry in the SD file. For some just as important, the DeduceBondSystemTool got some attention, and Kevin improved the code by having it look for all rings only in isolated ring systems, a trick we use elsewhere too, speeding it up significantly for some structures. BTW, that class still has limitations, and we are discussing other options. Atom parities can now be read and written in the MDL molfile format, and IUPAC isotope masses have been updated for a 2009 report. The PDB and MDL readers have seen two further improvements.

All in all, enough to mark this release too as a highly recommended update in the 1.4 series!

The changes
  • Helper methods now return results, rather than change parameter content, and uses interfaces as variable types rather than implementations (patch by Kevin Lawson). d4c4af2
  • Added Kevin Lawson to the list of AUTHORS e9c3f4b
  • The Bioclipse test case of failing bond order assignment 596258a
  • Unit test with the SMILES reported by Kevin Lawson d9c147d
  • Split up into ring systems before running the AllRingsFinder to speed up the algorithm (patch by Kevin Lawson) 47d65be
  • Bridge methods are methods introduced by compilers in relation to methods that use generics. Various renderer related method trigger test methods for just bridge methods, but are never really implemented. 32d58d6
  • Removing redundant code b56e08b
  • Fixed empty @return javadoc tags: some removed (e.g. with @inheritDoc), some now got a description c6ec88f
  • Added a missing # in the @see (needed for methods) dd4e299
  • Inline linking is done with @link, not @see 2e45004
  • Synchronized variable names in the API and JavaDoc 5554355
  • Fixed link to the CDKMCS class, which is now in a different package 7253481
  • IteratingMDLReader Skip Patch ca8f142
  • More unit tests. All methods are now at least tested once. (closes #3005889) e63ff94
  • Testing of the InChI generator 10fe8c0
  • Added dedicated unit testing for the InChIGeneratorFactory a80b6d5
  • Test to make sure the SMILES parser gets the formal neighbor counts correct 5b1654d
  • Fixed a wrong test for undefined values in the Burden matrix 516edbb
  • Added unit test for undefined G-M partial charges 38b548f
  • Updated bcut descriptor to check for undefined values before getting eigenvalues. Added test file and test case. Addresses bug 3489559 7d698ef
  • Fix for charge parsing in PDB files, based on the patch by Nils 7c7248a
  • Resolution for Bug 3485634 e2eb007
  • Same as earlier: use the returned object, which is not the same as the passed IAtomContainer 7199518
  • Assertion error fix a977e85
  • Write atom parity c7aa428
  • Fixed unit test: because the implementation creates an IMolecule object, the original 'molecule' pointer does not get atoms, and the returned IMolecule object should be assigned to get a molecule with information 8ff4584
  • Added close to unit-test resource stream eae7598
  • Added ability to read atom parity from Mol files (inc. junit test). 07bbbb2
  • Updated isotope abundance values according to the IUPAC technical report 2009. fb47c57
  • Updated unit test for commit #3093241, where null's are always larger than an actual object 1bb844d
  • Ensure the OpenJavaDocCheck plugins are compiled 03ef86b
  • s/NewDefaultChemOjectBuilder/DefaultChemObjectBuilder/g (closes #3456420) e2344ff
  • Fixes for the pseudo atom patch backporting: allow reading of IAtomContainer where IMolecule was already supported, fixed imports. Also removes an redundant, incorrect accepts() method 7dc949f
  • Updates MDL readers to set symbol of pseudatoms to label. Ensures that writing such a molecule does not force the SDF writer to see all pseudoatoms as R groups. Added unit test and test cases for V2000 and V3000 readers as well as V2000 writer e59d204
  • Made the method static, to follow JUnit practices aac73e7
  • Added Yap Chun Wei to the list of authors 2943deb
The authors
24  Egon Willighagen
 7  John May
 7  Rajarshi Guha
 2  Kevin Lawson
 1  Stephan Beisken

OK, formally, I had 26 patches, but two were gitified by me, but .java files sent by Kevin Lawson. A third patch was based on suggestions from Nils.

BTW, I have a tendency to end up in the upper part of this list, but please keep in mind I gain extra commits with minor, maintenance tweaks, and that I am tremendously happy with contributions from all! This release is another one, largely the result of work from less frequent contributors, but summing all those up, it defines what makes the CDK project going! Thanx to all in the CDK community!

The reviewers
9  Rajarshi Guha 
7  Egon Willighagen 
1  John May

Likewise, the importance of these reviewers should not be underestimated too. And this list only gives the names of the people that signed of patches; the list of reviewers is sometimes longer! In fact, one of my patches was the direct result of excellent peer reviewing by Arvid Berg (Uppsala University), who suggested me to use a one-line Method.isBridge() call, instead of my 40 lines of code! Well done!

Visualizing metabolite fluxes on WikiPathways pathways using a PathVisio plugin

Visualizing metabolite fluxes on WikiPathways pathways using a PathVisio plugin

Presenters: Anwesha Dutta (student in our group)

Date: Thursday March 29th 2012

Description:  Biological pathways provide intuitive frameworks to integrate and co-analyze different kinds of biological data, such as system-wide transcriptomic, proteomic, and metabolomic measurements. While insightful, pathway analysis is generally limited to qualitative conclusions, and the analyses can only be as powerful as the curated annotations can enable. Using our open-source pathway analysis platform, PathVisio, we will bridge pathway analysis to the wealth of quantitative approaches already in development for metabolic network modeling, such as flux balance analysis and dynamic simulation. Our focus will be on the visualization of the modeling results, which will be critical for understanding how simulated models correlate with experimental measurements.
The same biological processes that are visualized in pathways are also described by quantitative models. For example, the arrows that connect entities within metabolic pathways actually represent metabolite fluxes. The integration of large scale data analysis with modeled or measured fluxomics data, will help to gain more insights into the mechanism of the biological process.

The meeting is in the BiGCaT course room (1.302 In H1), UNS50 south wing 1th floor from 16.00 to 17.00.

Friday, March 09, 2012

Dutch government threatens with censorship: publishing needs export permit

De Volkskrant reported this morning very worrying news. Dutch minister Bleker threatens Dutch researchers to censor them, based on the idea that publishing is exporting knowledge, and therefore needs a permit from his department. One has to realize this is in the context of the flu mutation research in Rotterdam (see H5N1, and this Nature item), but such censorship threat is unacceptable, and is a direct attack on science.

Bleker (@HenkBleker) is the Dutch underminister for economics. De Volkskrant writes:

"Geheel terloops gaf Bleker aan dat hij een nieuw machtsmiddel denkt te hebben om publicatie tegen te houden. Publicatie staat voor hem gelijk aan export van gevoelige kennis en daarvoor is een vergunning nodig."

which translates to:

"Bleker indicates he believes he can use a means of power to stop the publication: publication is to him equal to export of knowledge for which a permit is required."

Very much like you need to export military weapons. For him it makes no difference for this reasoning if the publication itself is with a foreign (Nature, Science) or a Dutch (Elsevier, local newspaper) publisher.

The virus image is in the public domain and available from Wikipedia.

Oops... I forgot to get an export permit!

Bioclipse and SADI

I do have to apologize to the Dumontier Lab... Michel visited us when I was still in Uppsala University (so, that's way too long ago), and we spoke about SADI, a really interesting framework for semantic computing. At thought that at the time, and I still do. But, I never got around to playing with it... but this week was so weird, I had to get my mind off... so, time for some random hacking...

Thus, a SADI manager (client) for Bioclipse. Why? Because I can. Two minutes to have the Bioclipse SDK set up a new plugin template, two hours to decide that one by one copying Maven dependencies is the worst waste of time ever, two minutes to figure out how git svn clone for just the last few Subversion commits works again, five minutes of cloning, two minutes of running mvn assembly:assembly, three minutes of figuring out what to add to the pom.xml again, one minute of running mvn assembly:single, and three minutes of waiting for the registry triples to be loaded after calling the Registry.getAllServices() method:
> sadi.listServices()

The code is available from the bioclipse.rdf repository.

Tuesday, March 06, 2012

Visualization of Life Science data with Java (Apache 2.0)

Daniel Swan pointed me to this blog post with a rather interesting title: Exploring multiple cancer genomics alterations with Gitools. The Gitools is more than a library, but just the visualization seems interesting already.

(The image comes from their website, and is licensed CC-BY.)

Gitools itself is licensed Apache 2.0 and I am wondering if we can reuse this in other projects, e.g. Bioclipse.

Sunday, March 04, 2012

ChEMBL 13 as RDF

Update: this work is now described in this paper.
Last week, ChEMBL 13 was released, with even more data, data fixes, etc. Since my RDF for ChEMBL 09 my workflow has become more solid and uses more common ontologies, started using more common ontologies and ontologies I just like, such as CHEMINF and CiTO. Below is an overview of the resource types present in the RDF: activities (almost 7M now), chemical entities, assays, targets, and documents. 

The data on Kasabi will be updated soon, and the SPARQL end point hosted by Uppsala University was updated yesterday, including the SNORQL frontend:

The new data is not fully backwards compatible. The changes to the RDF include the use of cito:citesAsDataSource, more typing using existing ontologies, e.g. with cheminf:CHEMINF_000000 and pro:PR_000000001 from the PRotein Ontology.

A paper dedicated to the ChEMBL-RDF is in preparation. Existing use cases can be found here.

Saturday, March 03, 2012

Call for Input: a "" for the (Life) Sciences


define at most five diverse concepts by which triple sets can be summarized for the (life) sciences.

The website is used by various big web players to define a small set of entities omnipresent on the web. Talis' Kasabi uses this to summarize the triple data sets hosted, and uses the few entities for graphical icons combined with the number of resources of that type.

Now, defines types like Person, Organization, Event, and Product. To summarize the content of (life) sciences data bases, we need different entities to graphically summarize the triple set content. For example, Material Entity could be one, and Organism another. Sample might be useful, or perhaps Experiment.

Any set of concepts will do, but the challenge is limited to a mere five concepts. General terminology is therefore more suitable than detailed concepts. Keep in mind that more detailed vocabulary overviews are typically available anyway, but those are detailed to provide a short overview.

Indeed, the five concepts must also advertise properly the content of the triple set, so that scientists can quickly see if this triple set is of interest to them.

(Image: Wikipedia; U.S. public domain)

Thursday, March 01, 2012

Bioclipse 2.5 development version download size analysis

Arvid has done a hell of a job getting Bioclipse building on the Hudson server. I have been configuring the system to compile Bioclipse extensions, and it is not trivial; I dare say, it's harder than learning git. New binaries are now available with further improvements to the core platform, ultimately leading to the next stable Bioclipse 2.6 release. Well, I can say it is worth the wait, but several papers have already been published showing 2.6 functionality, such as the Bioclipse-OpenTox paper.

I just downloaded the binary for 32bits Linux (here) and noted the size. I think it's gain some wait, and was wondering what the biggest contributions were. So, I used the KDE's Konqueror web/file browser to visualize the disk use:

The largest plugin turns out to be the Bioclipse plugin for JasperReports which is used by the Decision Support plugin. I do not have all source code of all Bioclipse extension installed, and cannot see if there is other code also depending on it. But if that is not the case, it may be a nice opportunity to make the default Bioclipse download a bit smaller.

Another interesting thing I noticed, is that Bioclipse comes by default with Ant... wondering what that is being used for, or what plugin is depending on that...