Pages

Tuesday, November 27, 2012

Triples, stores, and SPARQL in R

For some time I have been stealing an hour here and there for the rrdf package for R. This package is based on Apache Jena and allows reading and writing of RDF triples, as well as doing local and remote SPARQL querying. BTW, rrdf is not only R package to provide SPARQL functionality, and another package will be demoed at SWAT4LS.

It took me some time to get around to it, but I finally set up a vignette with Sweave for this package, but here it is, explaining some basic functionality of the package, just in time for #swat4ls:



So, with a week or so, I used <iframe> in the blog a few times now. The above one is being served by Google Drive.

Friday, November 23, 2012

A Mendeley group for @Open_PHACTS

The past few months has seen an increasing paper trail for our Open PHACTS projects. Lot's of cool stuff is ongoing, and more and more is getting openly available. There is a steep learning curve within the project on being Open, and the project makes sure it is done properly. But it takes time. With the Open Standards and Open Source getting out now, I think we have a reasonable start.

Yesterday, I created a Mendeley group with our paper trail:


The purpose was to get some #altmetrics on the impact of our project (I  blogged about that a year ago). We haven't started tagging the papers, and comments on useful tags are most welcome. Should we tag with matching Example Application, with consortium partner, both? Something else? Let me know.

Of course, Mendeley is also an attractive platform, with Word/OpenOffice/LibreOffice plugins for reference management, and it provides nice web pages for papers with possible direct links to OA PDFs and bookmark statistics. For example, for this paper:


We can see the paper detail in the middle, and get added value on the right. We see a thumbnail of the paper, a list of authors with Mendeley profiles (here only Rob Hooft), and then the reader statistics, and learn that 129 people have taken the time to put this paper in the reference database.

In fact, that is quite a lot. You can believe me, or you can look up the numbers. That is what #altmetrics is about. We can use altmetric.com and find these numbers:


Not just does such #altmetrics give us a number, it actually tells us who, what, why, and how. Much more informative than, let's say, a journal impact factor. This is scientific communication in action.

What this page does not tell us, is whether 46 is high, though the page does comment that this paper "is one of the highest ever scores in this journal (ranked #6 of 1,586)". Now, this is a Nature Genetics papers, and more than 1500 Nature Genetics papers got a value, and this paper is ranked #6! Yes, that is impact.

Total Impact gives further detail, but is called ImpactStory now. I ran this on output from our project, papers but also software and slides. One neat and really useful feature of this webpage is that it provides percentiles on data, in more detail than the comment from altmetric.com:


We get an detailed of view on where the impact is found, and the percentile information. Here too, we learn that this paper has a relative high impact, compared to peers: it is in the top 3% of papers by impact. Interestingly, the Mendeley reader count was not picked up (update: this was tracked down as a data glitch in the Mendeley database). Mind you, the percentiles for 2012 are not yet available; we have to wait a month or two for those.

And, all counts are linked. Just click on, for example, "72 tweets" (using Topsy) and you get the actual tweets, and learn what people have to say about this paper:


Once more, this is scientific communication in action!

But to do full justice to Euan Adie's altmetric.com work, that side captures the blogosphere pretty well (not suprisingly, given it is Euan). Just check the screenshot above again.

Sunday, November 18, 2012

DHSs and histone modifications: methylation, acetylation, citrullination, and phosphorylation

One day on, and still struggling with the chemistry behind gene regulation. Let no biologist ever tell me again not to use acronyms (yes, I am looking at you!). But it is interesting. I learned a lot about ChIP, histone modifications, etc, etc. This is an amazing world, where specific histone complex protein residues get methylated, acetylated, citrullinated, and phosphorylated. Of course, all this is in the context of the ENCODE meeting we have tomorrow at BiGCaT, where I will try to cover a paper by Thurman et al.

In that paper, Thurman studies the links between DNase I hypersensitive sites (DHSs) and markers of regulation. These DHSs are areas between histones where the DNA is free of histone proteins. There are remarkable images around showing histones as beads on a string, and the distances in nucleotides between histones is in fact not that large. In fact, a histone, despite a large complex, sterically hindering 50% of the DNA access does not stop translation; the transcription complexes apparently have no trouble passing the histones, as described by Felsenfeld et al. Quite amazing!

Now, those histones are chemically modified with acetyl, methyl, phosphates, and other groups. At well-describes residues, and each easily regulates modification of other steps. And everything regulates gene expression. Oh, and as we say yesterday, all that is regulated by metabolites, which in turn... Lovely. Try modeling that mathematically :) Here's what Abcam has to say about it:

Acetylation is generally linked to gene activation. Acetylation on Lys-10 (H3K9ac) impairs methylation at Arg-9 (H3R8me2s). Acetylation on Lys-19 (H3K18ac) and Lys-24 (H3K24ac) favors methylation at Arg-18 (H3R17me). Citrullination at Arg-9 (H3R8ci) and/or Arg-18 (H3R17ci) by PADI4 impairs methylation and represses transcription. Asymmetric dimethylation at Arg-18 (H3R17me2a) by CARM1 is linked to gene activation. Symmetric dimethylation at Arg-9 (H3R8me2s) by PRMT5 is linked to gene repression. Asymmetric dimethylation at Arg-3 (H3R2me2a) by PRMT6 is linked to gene repression and is mutually exclusive with H3 Lys-5 methylation (H3K4me2 and H3K4me3). H3R2me2a is present at the 3' of genes regardless of their transcription state and is enriched on inactive promoters, while it is absent on active promoters. Methylation at Lys-5 (H3K4me), Lys-37 (H3K36me) and Lys-80 (H3K79me) are linked to gene activation. Methylation at Lys-5 (H3K4me) facilitates subsequent acetylation of H3 and H4. ... ...

And that goes on for a while. Ambitiously, I started converting things I read into a WikiPathways:


I think that will keep me busy for a while. I won't even attempt to complete it further tonight. I have given up on that about an hour ago. In fact, I returned to the paper by Thurman, as I still have to figure out how their experimental methods work. In fact, how does one even detect the chemical modification of a histone, and to which DNA sequence on any of the chromosomes it belongs?? I mean, that's not AFM or STM, I say...

No, it's ChiP. ChIP on a chip, in fact. They have antibodies are stick particularly to a histones with one particular modification. That is how I actually ended up on that Abcam web page in the first place. Check out this nice western blot. With a huge antibody detecting whether there is an acetyl modification. Wicked!

Well, earlier I learned that proteins detecting methylated CpG bases not because of the methyl group (which amazed me already), but by a distorted hydration in the major groove due to MeCP2 binding. Seriously! Eat that, organic chemist friends!

So, Thurman and friends find distal DHSs and relate these to cis-regulatory elements. To some extend, puzzling, because the above tells us that a lot of regulatory work is happening outside those DHSs. But then again, I did read today about DNA methylation triggering histone modifications. It seems there is so much interactions going on, that it resembles a melting pot. Oh wait, that makes sense; it's one big one pot synthesis anyway.

The paper discusses an enormous amount of experimental work, and I cannot seem to be able to make sense of it all. There are striking aspects to it, which I will touch upon momentarily. But I cannot help but mentioning that I am not sure they could either. Their Discussion section leaves something to be desired, like an actual discussion. Instead, they just summarize the paper.

They used ChIP with Cell Signaling's 9751 antibody recognizing H3K4me3, with formaldehyde-induced crosslinking. It actually turns out, that the peaks for this modification are right on top of the DNA part from which the transcript is made, in line with Felsenfeld's observation. Upstream of that, where the promotors are expected, that is where DNase I signals are found. That is, I think this means that the DHS upstream of the histone where transcription starts is where the promotor regulation happens. With transcription factors (TFs), of course. And in those DHS regions, that is where DNA methylation happens, and Thurman finds DNA methylation in those regions, inhibiting TFs binding, because the already mentioned MeCP2 already takes that place.

Now, then they make a jump from this low level chemistry, to a genome wide landscape. Well, they actually start with that, but as a chemist, I am more of a bottom-up guy (that is an IT method). They report that most DHSs are found in introns and at distal locations. The first is striking: the ratio between intron/exon is >99. Does that imply that exons basically are always DNA wrapped around histones?? Does that actually then tell me that transcription actually sort of requires steric hindrance of the histone?? Ha, those diagrams biologists would be even more misleading that they have been to me (don't ask me how long it took me to learn that there are some 10-40 mitochondria per cell! and I still do not know if all copies in the cell have the same DNA, or if they are more like a population like your microbiome).

Now, distal DHSs are the second largest group, and capture some 40-45% of all DHSs. Distal means typically more than 2.5 kb away from the TSS (transcriptional start sites). Most of them are somewhere between 10 and 50 kb away. Now, isn't that something? That is distant indeed!

What? Still with me? Let's do some math. It's hard, and I hope to get it right. A human has about 3 billion base pairs (I'll take the WikiPedia count). The paper finds almost 3 million DHSs. That means that the average distance between DHSs is about 1 kb. Compare that to their diagram 1b, outline in the previous paragraph. That means that the DHSs must be very densely placed around the transcribed genes. Indeed, they report ratios of up to and above a 100 fold increase. It must be like that, because otherwise, you cannot get those distances for distal DHSs.

Now, another interesting aspect of the paper, is that they find different DHSs for different cell types. That, in fact, increases the average distance between DHSs: those 3 million they find is for 125 cell lines, and more DHSs are found in less then 20 cell lines. Only promotor-related DHSs seem to be more persistent between cell lines. This implies that different cell lines, have different genes unfolded in nucleosome/DHS rich areas (defining the chromatin accessibility), triggering different gene expression. That all makes sense, and rather existing too. As such, it seems to me that this map effectively gives a predictive model, indicating which genes are expressed in which cell types.

A further question they ask is if DNA (not histone) methylation is the cause of the result of DHSs. The confirm earlier found correlation between DNA methylation and gene silencing. They basically question if the things like MeCP2 binding happen because no transcription factor is in the way, or that TF cannot bind because MeCP2 is there. Chemically, these are perhaps equivalent: they have competing binding affinities. Except that the methylation must happen at some point too. The suggest that that may be due DNA getting randomly methylated, perhaps not unlike passive demethylation. Chemically, that does not make sense to. I would guess there are many chemical species in the cell that would get more easily methylated... They believe to have found evidence for passive deposition, but also find positive correlation between methylation and gene expression. I would say, the answer is still out there.

OK, that's about how far I got now. The last two pages I have to read again, and see what papers I need to read to make sense of that. And I will try to see what others have been saying about this paper. One hooray for #altmetrics!

ResearchBlogging.orgThurman, R., Rynes, E., Humbert, R., Vierstra, J., Maurano, M., Haugen, E., Sheffield, N., Stergachis, A., Wang, H., Vernot, B., Garg, K., John, S., Sandstrom, R., Bates, D., Boatman, L., Canfield, T., Diegel, M., Dunn, D., Ebersol, A., Frum, T., Giste, E., Johnson, A., Johnson, E., Kutyavin, T., Lajoie, B., Lee, B., Lee, K., London, D., Lotakis, D., Neph, S., Neri, F., Nguyen, E., Qu, H., Reynolds, A., Roach, V., Safi, A., Sanchez, M., Sanyal, A., Shafer, A., Simon, J., Song, L., Vong, S., Weaver, M., Yan, Y., Zhang, Z., Zhang, Z., Lenhard, B., Tewari, M., Dorschner, M., Hansen, R., Navas, P., Stamatoyannopoulos, G., Iyer, V., Lieb, J., Sunyaev, S., Akey, J., Sabo, P., Kaul, R., Furey, T., Dekker, J., Crawford, G., & Stamatoyannopoulos, J. (2012). The accessible chromatin landscape of the human genome Nature, 489 (7414), 75-82 DOI: 10.1038/nature11232

ResearchBlogging.orgFelsenfeld G, Boyes J, Chung J, Clark D, & Studitsky V (1996). Chromatin structure and gene expression. Proceedings of the National Academy of Sciences of the United States of America, 93 (18), 9384-8 PMID: 8790338

Saturday, November 17, 2012

The chemistry of DNA modifications for gene regulation

I have started learning about epigenetics, and particularly the regulatory effects of DNA methylation and histone acetylation. It's cool, it's hot, it's everything we hope will explain genetics, because genes certainly did not.

The chemistry behind this involves interesting pathways, involves storage of information that passes from one generation to another... epigenetic effects down to the grandchild generation have repeatedly been shown now. I likely candidate are mRNAs that persist beyond the cell division, which trigger modifications again. Well, that is cool chemistry indeed! So, the chemist in me asks: so where are residues actually methylated then? I am learning here, and trying to get the facts together. But, the bases seem to be one place, blocking interactions with DNA-binding proteins which can show beautiful residue/base pair interactions at the sides of the bases. Second year students at Maastricht University in Biomedical Sciences had this as part of their practical last year.

But for that genetic information to pass around and persist, and for gene regulation in general, there are brilliant pathways, which may involve metabolites, like butyrate, which acts as energy source in certain systems. Donohoe et al. report work around a pathway for histone acetylation, where they found an interaction with the Warburg effect. While in both cases butyrate triggers an increased acetylation, the mechanism is different. They propose this pathway, which I am making available on WikiPathways (CC-BY):


The page on WikiPathways is not complete yet, but I haven't completed reading the full paper yet. I wonder how many of these pathways are known. Do you know one? Leave a DOI/PubMed ID in the comments, or add the pathway to WikiPathways yourself.

ResearchBlogging.orgDonohoe, D., Collins, L., Wali, A., Bigler, R., Sun, W., & Bultman, S. (2012). The Warburg Effect Dictates the Mechanism of Butyrate-Mediated Histone Acetylation and Cell Proliferation Molecular Cell DOI: 10.1016/j.molcel.2012.08.033

Sunday, November 11, 2012

CDK 1.4.15: the changes, the authors, and the reviewers

At some point I had thought that I could finally concentrate on master. We have enough regressions there, of various kinds, some 40-50 unit tests that did pass properly in the past. Various core changes that increase the accuracy of our library have the nasty side effect that they uncover certain assumptions. But let's not talk about master yet, and focus on the 1.4.15 release (download here). Unlike I had hoped, a lot changed since the 1.4.14 release. On the bright side, CDK 1.4 is getting more and more reliable with every minor release.

I major addition in this release is that of a data model for double bond stereochemistry, making the CDK now handle to two most common forms of stereochemistry for small molecules. It must be stressed that not all IO classes are reading data into this data model yet. The interface looks like (the full JavaDoc is found here):

  public interface IDoubleBondStereochemistry
  extends IStereoElement {
    public enum Conformation {
        TOGETHER,  //  as in Z-but-2-ene
        OPPOSITE   //  as in E-but-2-ene
    }
    public IBond[] getBonds();
    public IBond getStereoBond();
    public Conformation getStereo();
  }

Other new functionality include an alternative aromaticity checker, which is happy to mark rings aromatic even if the ring has double bonds pointing outside the ring (e.g. benzoquinone). That means, we now how two algorithms in the CDK to perceive aromaticity.

Otherwise, there is a truck load of fixes. One really important one, is the fix that ensures that stereochemistry is also cloned(). Other fixing include minor atom typing work, including new selenium atom types, the further generalization of the IO accepts() methods, and a fix in the SDG code to not delete bridging hydrogens before doing structure clean up. There are many more small fixes and tunes, and as always, the full list is given below.

The changes
  • Fixed a bug present in many readers: it would not accept a subclass if ChemFile (e.g. NNChemFile) even if ChemFile itself was accepted bc30798
  • Fixed loading of the right class when reporting possible alternative constructors 420533e
  • [bug:1275] added check to ensure that when String.substring is called the string is long enough 5a7baa4
  • [bug:1274] added conditional to ensure that when multiple bond stereo is specified as attributes and characters only one is used. This is achieved by using the existing flag to determine if a stereo bond value has already been provided. 0e97eff
  • Updated Gilleain's code to hook in with the other two new selenium atom types 9fb2dd3
  • Missing Se.2 atom type and test case 21186c0
  • Added finally cause to ensure the file is closed a50c303
  • - added unit test to demonstrate the bug a6d3d6e
  • added unit test to demonstrate bug and correct bug id's for two recents tests 0ea90e8
  • Adding CMLReaderTest of io module test suite fefc82d
  • Resolved NoNotify fails on AtomParity. Error was due to subclassing of AtomParity. Also the assertEquals params were swapped as the assertion was the wrong way. 46ffbb4
  • Added unit tests for bugs 1270 (removalAllElements should remove stereo elements) and 1273 (double bond stereo chemistry constructor should throw an exception on wrong input). Added @cdk.bug tag for bug 1264 (stereo element cloning) 938306a
  • Used return covariance on clone() to provide cleaner front-end API c3d4af0
  • Added deep cloning of stereo element to atom containers and polymers (atom container subclass) a46545d
  • Added stereo element shallow copy 7a1f243
  • Added a 'map' method on all IStereoElements. The map method allows a stereo element on one container/molecule to mapped to a stereo element on another. This mapping is achieved using two symbol tables, one for atoms and one for bonds. All methods are null safe and the mapping will not fail if any content in the stereo elements is null. The mapping simplifies the cloning of molecules/atom containers but could also be used when comparing isomorphic graphs. 267ec79
  • Reworked AtomContainer.clone() so it is clear what is going on. We now use a HashMap between the original and cloned atoms to avoid to a linear search each time the atom mapping is needed. This is also useful when we add StereoElement cloning (not yet implemented). We also store a bond mapping as well - we will need the bond mapping for double bond stereo chemistry. The stereo elements in the clone need to be set to an empty array on the clone so we don't remove elements from the original (cloning odditity). It was also clear we need to change the clone() method on Polymer which currently undoes all the cloning work we do in the AtomContainer. For all clone instances I added some code to correctly create HashMaps that won't need to resized. The default HashMap implementation works best at 0.75 capacity - we therefore need to do some simple arithmetic to ensure we don't get a resize. The implemented method is what is used in the Guava library. f98b04a
  • Added unit tests for DoubleBond and AtomParity cloning a7e4255
  • Added ability to setStereoElements - this was required due to clone() being shall on List. We need to be able to set a new array when have cloned a AtomContainer 84f8f0e
  • Added unit test for tetrahedral chirality stereo element 83333d7
  • Fixed closing (fixes #1265) a9a27ed
  • Added cdk-silent dependency for test-renderextra 2afae5c
  • Added renderextra to dist-large and test-all targets 8178890
  • Removed redundant code from ChemObj clone - the existing code did exactly what the copy constructor of HashMap does and thus provides a cleaner implementation 8f7c0ab
  • Added removal of stereo elements in 'removeAllElements()' - documentation has been updated d1e8fa6
  • Added check to ensure a DoubleBondStereochemistry is never created with more then 2 bonds - this would cause errors with some methods. f8f98fe
  • Removed print to standard out from ChemObjectBuilders ab6c308
  • Removed redundant code - we don't need to check whether the bond is already in the container as we create a new instance. We also don't need to check the array size as this is done by addBond(IBond) c7786c9
  • Moved TetrahedralChirality from data to core. 0e41b05
  • Added unit test and @TestMehtod annotaitons for new 'isEmpty' mehtods aa0f969
  • add isEmpty() to classes/interfaces ChemModel, AtomContainerSet, ReactionSet, Crystal and AtomContainer f0c14fe
  • Be more informative when the test fails 8b8a848
  • Added missing test annotation bbf00f6
  • Documenting new method and extended unit test 74d206e
  • Removed SVN tags, as suggested by the reviewer 4e35feb
  • Removed cdk.create dates, as suggested by the reviewer 49b6541
  • Testing that benzoquinone is perceived as aromatic using the alternative detection method. 702d8ba
  • Because the placement of double bonds is not deterministic, we cannot be sure we always get them at the same location. Better is to just test that all carbons are perceived as C.sp2 and that they are aromatic. 77a778f
  • Added an alternative aromaticity perception model, which is happy about double bonds pointing outwards from aromatic rings b5bf695
  • Added the missing S.2minus atom type for selenide 367ff4f
  • There is no Se.2 atom type in the CDK; the perception seems to match Se atoms with two neighbors; I added two unit tests for the changed code, assuming one and two implicit hydrogens 48e6060
  • Added a missing import and dependency f400a5e
  • Added aromaticity-based perception: N.planar3 67c27bd
  • Added a null check and return immediately (fixes #1260) 0a981a4
  • Ignore this failing test case; it was one of the original points at which we decided new tools were needed 8adfd90
  • Added double bond stereo 5b644f9
  • Added a data model for double bond stereochemistry 76577b4
  • Added similar testing to reader and writers to fix four further unit tests for support of matching against some IChemObject interface class 47f35b9
  • Fixed the readers and writers to also accept the matching interfaces (fixes #3553780) bfc674a
  • Test that the reader and writers also "accept" the interfaces they support, see bug #3553780 f30f9a3
  • Added a unit test for the JCP bug report for the SDG about briding hydrogens 4d3db46
  • simplify by calling getConnectedBondsCount() 320a21d
  • only delete non-multibond H's; fixes JCP issue 8 d0d785c
  • Fixed the unit test, similar to commit d1da5276dae4a21a4c45d9fa41816be5eb646b4aa: the compound is aromatic. 81d7be7
  • Adds a unit test and fix for the loading of atom pair descriptors. a6ab39c
The authors

26  Egon Willighagen
25  John May
 3  Ralf Stephan
 2  Stephan Beisken

The reviewers

25  Egon Willighagen 
17  John May 
 2  Rajarshi  Guha 

Brushing my biology: cool diagram of chromosomes in the nucleus

I already mentioned this ENCODE discuthon we have next week. As I have to discuss stuff about hypersensitive DNA regions, I have to seriously brush up my biology. Brush up?? That suggests there was a decent basis. Well, think again. I though history was much more interesting!

I am a chemist. When people talk about the DNA in the cell, I always considered a single molecule. Until I learned there are 46 chromosomes. So, each cell actually has 92 DNA molecules. That was a revelation I had somewhere in my second or third year at the university. Remember, I did not have biology in secondary school.

Anyway, no book ever showed me what that really looks like. Yeah, schema. Cell diagrams with one mitochondrion in the cell. Well, bloody yes, the cell has very many of them, thank you very much. It just did not fit the diagram, I guess...

So, I just ran into this way cool figure from the Three-Dimensional Maps of All Chromosomes in Human Male Fibroblast Nuclei and Prometaphase Rosettes paper by Bolzer et al. (doi:10.1371/journal.pbio.0030157). I actually ran into the WikiPedia version of it, in the chromosome article. It's an adaption, but the original is much better even.



I can just wish they had added a Jmol applet with the 3D rods, rather than these static images.

But I find the flatness at 90o weird... what is the story behind that? Is their method not really or not fully 3D? I guess I will have to find some time to read up on their Methods section...

Saturday, November 10, 2012

Java Puzzlers and FindBugs. Running then on the CDK silent module

The CDK has been using PMD for quite a while yet, but there is another tool, called FindBugs. I had seen this before, but until I watched two Java Puzzler videos John sent me (here is one), I had not used that much. There is a nice Eclipse plugin, and you can run it on any Java package.

As I was procrastinating anyway (I should be preparing my core teaching qualifications portfolio and prepare a presentation on the accessibility of chromatin in the human genome, in an upcoming ENCODE discuthon. Mind you, you have a lot of individual DNA molecules in your cells! Just your core 92 to start with and then your mitochondrial DNA (does all mitochondria in one cell have the same DNA??), and the fact that your DNA in your toes is unlikely to be the same on your nose (I learned that at the DiXA meeting in Berlin), or that your microbiome with its own DNA has a huge influence on your well-being?? Well, that was our dinner table discussion anyway).... (still breathing...)

... so, while I was procrastinating, I ran FindBugs on the org.openscience.cdk.silent package. I mean, what could possible go wrong? ...


Thirteen possible problems! I mean, seriously, this is the core of the CDK!
Here are three unit tests to uncover three of the found issues. If you like, take them "CDK Puzzlers"...

  @Test public void testCompare_MassNumberIntegers() {
    Isotope iso = new Isotope(Elements.CARBON);
    iso.setMassNumber(new Integer(12));
    Isotope iso2 = new Isotope(Elements.CARBON);
    iso2.setMassNumber(new Integer(12));
    Assert.assertTrue(iso.compare(iso2));
  }

So, we create two 12C isotopes, which should be the same. Of course, this test failed. The culprit is a == comparison in the compare() methods code, and Integer objects are not the same. Doing the same starting with ints goes better, even after casting, and the next test does not fail:

  @Test public void testCompare_MassNumber() {
    Isotope iso = new Isotope(Elements.CARBON);
    iso.setMassNumber(12);
    Isotope iso2 = new Isotope(Elements.CARBON);
    iso2.setMassNumber((int)12.0);
    Assert.assertTrue(iso.compare(iso2));
  }

And, indeed, using Integer.valueOf() helps too, something that PMD is keen on suggesting too: the next unit test runs fine too:

  @Test public void testCompare_MassNumberIntegers_ValueOf() {
    Isotope iso = new Isotope(Elements.CARBON);
    iso.setMassNumber(Integer.valueOf(12));
    Isotope iso2 = new Isotope(Elements.CARBON);
    iso2.setMassNumber(Integer.valueOf(12));
    Assert.assertTrue(iso.compare(iso2));
  }

The others two issues I wrote tests for, have the same underlying issue, causing these two tests to fail. But note that now we do not even need to use objects:

  @Test public void testCompare_ExactMass() {
    Isotope iso = new Isotope(Elements.CARBON);
    iso.setExactMass(12.000000);
    Isotope iso2 = new Isotope(Elements.CARBON);
    iso2.setExactMass(12.0);
    Assert.assertTrue(iso.compare(iso2));
  }

  @Test public void testCompare_NaturalAbundance() {
    Isotope iso = new Isotope(Elements.CARBON);
    iso.setNaturalAbundance(12.000000);
    Isotope iso2 = new Isotope(Elements.CARBON);
    iso2.setNaturalAbundance(12.0);
    Assert.assertTrue(iso.compare(iso2));
  }

The tests are filed as patch here.

I think my peer review has just become a bit more tough...

Wednesday, November 07, 2012

The #OpenScience Working Group needs you

Since some time I have been member of the Open Science working group of the Open Knowledge Foundation. As such, I organized lunch meetings in Stockholm about Open Science (join this mailing list) and participated in working group efforts, such is running Is It Open Data (RIP) on the HCLS LODD data sets (many of them turned out to not be Open at all). Also, I love to have Open Science lunch meetings in Maastricht (and/or in Eindhoven), and if you do too, join this mailing list.

But the working group does a lot more, and this week Jenny Molloy send out a call for participation. There are a lot of possibilities; she wrote:
    Dear All,
    
    As we have grown to over 400 people on the mailing
    list and activities have expanded it would be great
    to form a committee with a few more people on board to
    make sure the working group is as effective as
    possible in achieving our mission of opening up
    science and scientific research outputs.
    
    If you'd be interested in getting more involved in
    running the group and our activities and projects,
    get in touch! It would also be great to have
    representatives in local areas willing to act as
    open science champions in their own countries or
    cities.
    
    The types of roles we'd like to fill are below, as
    well as a reminder of some of the projects we've got
    on the go. The time commitment will be flexible and
    relatively low, but it will make a big difference to
    have someone keeping an eye on specific areas! If you
    know anyone not on the list who might be interested
    in getting involved, please forward this message to
    them.
    
    Working group coordinator (working with Jenny)
    - Blog Editor
    - Tech/Dev Lead
    - Event Organiser
    - Designer
    
    Active Projects:
    
    - Panton Principles and Panton Fellowships
    - Content Mining Manifesto
    - pyBOSSA
    - Open Research Data Handbook
    - Open Science blog
    
    Tools and activities from other working groups of
    special interest:
    
    - BibServer (developed by Open Bibliography)
    - Open Access Index (@ccess and Open Bibliography)
    - DataHub (CKAN Team)
    - Who Needs Access? (@ccess)
    
    If you have any questions, please let me know - I
    look forward to hearing from you!
    
    Jenny
    
Really, there is really a lot you have to do, and as Peter Murray-Rust replied to Jenny's call, you do not have to be paid as scientist to join!

Tuesday, November 06, 2012

What online services support InChI and REST?

The adoption if InChI is increasing, despite its limitations. But one thing I find greatly missing, is chemical databases supporting access of entries via the appropriate InChI. I know there are resolvers around, but that is different. They do a search, and give me multiple links to individual structures that may match to a certain extend. I am not interested in that in this context.

What I want instead is to be able to deep link to a particular entry in ChemSpider, PubChem, HMDB, or whatever databases using the InChI instead. The only service currently supporting this that I am aware of, is rdf.openmolecules.net. It uses a URI pattern like http://rdf.openmolecules.net/?$INCHI. For example, the entry for methane is http://rdf.openmolecules.net/?InChI=1/CH4/h1H4 and this URI deep links to the entry of methane, rather than a search result list.

So, the core requirement is that the database URI tells me, either: "yes, this is the one and only entry matching 100% this InChI", or "no, I do not have data for this structure".

What other databases support deep linking using the InChI? And what would the URI look like?