Wednesday, June 27, 2012

Isbjørn #2: added recursion and FOAF support

I added recursion and FOAF support (foaf:page, foaf:homepage, foaf:depiction). More to follow, including CHEMINF support for properties.

Spidering the Semantic Web for Linked Open Drug Data (a start)

Hacking up a quick Bioclipse plugin to spider the Linked Open Drug Data network for useful information. The Bioclipse Scripting Language snippet (JavaScript dialect):

Monday, June 25, 2012

CDK gatekeepers do not scale

Some six years ago, the Linux kernel development hit a critical point: the observations that Linus does not scale. The CDK is facing the same similar issues now. Recently, the amount of activity has gone up, and both Rajarshi and I seem to have had less time than normal. As you know, the CDK project uses peer review, but, unfortunately, the amount of peer reviewing is low. Part of this is perhaps due to peoples' inexperience with git, which makes the review harder than needed. Neither do we have a proper peer-review system set up, like Gerrit. Sadly, this is rather disappointing to people, and particularly to those who work hard on those patches.

This must change. The solution the Linux kernel development community adopted was that op trusted lieutenants. Developers with a reasonable git knowledge that would act as co-editor of the kernel. They receive patches, make sure they get reviewed, and, importantly, make them ready for inclusion in the main development tree. Except for the last step, we already use this approach: any CDK developer can sign off a patch, taking responsibility that the patch is ready for inclusion in the main tree and meets our project standards. That signals the gatekeeper to pull it in into the main tree; the only thing missing, is that we do not have people formally taking that role, and in particular, making sure to put the patches they approved up in a single branch.

However, some CDK developers are experimenting with this, and the time has come to discuss how we, as CDK project, want to give this further shape. A critical aspect here is that the gatekeepers can trust these lieutenants, so that they know the ins and outs of the CDK community standards.

Therefore, I have created a Doodle so that we can schedule a CDK developers meeting, and I hope a lot of currently active (and formerly or future active) CDK developers will join. The meeting will probably take place over Google+ Hangout, but Skype is an alternative if the former is not possible or not wanted.

Friday, June 22, 2012

Depiction of chemical graphs

I read just now about the new ChEBI release and particular this nice drug with gold in it:

This is a wonderful structure!

Notice the InChI given for this structure below, which shows actually a charge-separated structure. And this InChI has a "/q;;+1". That suggests to me that only the gold is charged, and thus that the full structure is charged. That does not match the drawing.

Is that inconsistent? Yes. Welcome to cheminformatics :)

CDK 1.4.11: the changes, the authors, and the reviewers

The CDK 1.4 series is really getting into slower waters. With only one new feature anticipated, we still see bug fixes though, and I'm sure there are a few left to review.

For me, the most important patch is that of Kevin for finding the position of double bond positions, important when taking SMILES input (see SMILES, Bioclipse, double bonds and Finding where to put double bonds...). The other two changes are files for developing CDK with NetBeans and for roundtripping 2D-based stereoinformation for bonds in CML.

The changes

  • CML Bond Conversion Test [patch:3530233] 63f0aeb
  • Improved conversion of AtomContainer to CMLMolecule b4e480c
  • Provide the necessary project files for netBeans 8caf84c
  • Changes to avoid inadvertent alteration of passed IMolecule and to deal with molecules lacking implicit hydrogen atom information 710af3e
  • Added the tool to the smiles module, and the tests to the module's test suite 575dd2c
  • Class to fix bond orders for aromatic rings 7867bd6
The authors
2  Kevin Lawson
2  John May
2  Egon Willighagen
1  Jonty Lawson

The reviewers
2  Egon Willighagen 
2  Nina Jeliazkova

Thursday, June 21, 2012

SMILES, Bioclipse, double bonds

I wrote the other day about Finding where to put double bonds... and that Kevin and Klas were writing code to address that issue. Klas' code is more general, but yet unfinished. For the upcoming Bioclipse 2.6 (yes, it is really happening!), we settled for the Kevin's code for a post-processing after SMILES parsing. That does not solve our support of MDL molfiles with query bonds, but it's an important release blocker solved:

So, the latest build will automatically assign double bonds, so that when you start editing the structure, it is much easier for Bioclipse to keep up with what is going on, solving all sorts of corner cases. (A good alternative name for cheminformatics would be cornercasinfomatics :).

I stress one more that if people would just say what they meant, all these things were non-issues. It's really the scientist messing up, and Bioclipse (/CDK) has to figure out where he messed up. I still have a very strong opinion on have your cheminformatics making assumptions just because the scientist has been lazy. That is just the world upside down.

One last reminder: you can switch the rendering between aromaticity rings and normal double bonds:

So, with that 'Show Aromaticty' unchecked, you can depict structures like this:

Actually, as you can see in that Preferences page, you can set a lot of rendering options. The JChemPaint rendering stack is very tunable. We have long wished to use a CSS-like format to share such settings, but never gotten around to that yet. Someone interested?

Saturday, June 09, 2012

Friday, June 08, 2012

Adding papers in Mendeley by DOI


I'm sure all the Mendeley users know that already, but you can very easily add papers by DOI (Add entry manually). Just click the looking glass icon.

Tuesday, June 05, 2012

Sunday, June 03, 2012

Tax-paid research as innovation infrastructure

We pay tax to fund things we are a community have a need for: public transport, infrastructure to support industry, health care, etc. We also fund universities, because we've learned that an increased knowledge about the things around us, are good for the country too.
However, we typically leave in the middle what is good for the country. It is debatable. Are highways for industry, for people to go to work? Health care? Only for those who do not smoke? Is it acceptable that only a subset of the people in a country benefit, assuming they will propagate that benefit to others? Or should everyone have direct benefit? I think for many things these issues are pretty clear: everyone can use the trains on the railway infrastructure.
With this step taken, we make things a bit more complex: we added additional fees: Yes, you are allowed to make use of the railway infrastructure, but you have to pay for how much you use the trains on that infrastructure. Do you pay twice? That's probably just wording. Practically, it is probably fair. The tax covers the infrastructure, but the services on top of that is not payed for: you pay for infrastructure and use separately.
The above focuses on the concept of infrastructure: the things that enable other things. A railway system that enables people going to work; allows people to live in a nice living area, but work where the action is.
Where does that place research by universities and governmental institutes? These were - I guess; I wasn't there - started with the idea that they trigger innovation and create a more pleasant society (e.g. one without cancer, with friendly people, with proper amount of food, which cheap consumables, with few societal problems, etc, etc).

Academic/governmental research is an infrastructure on which we can serve society.

And, us people, are happy to pay for that, because it will create a better future.
We should then wonder how those pieces fit (or can fit) together. Follow the money. The people pay tax, this goes to building a knowledge infrastructure, which is used by industry to create new medicine, more efficient power supplies, etc, etc, to provide mankind with a future.
Therefore, we all want our buck well spent. Importantly, this doesn't just apply to citizens, but it hugely important too to all those small- and medium-sized enterprises (SMEs) as well as large companies that work on this infrastructure to build new products! This is not some leftish statement, and has nothing to do with left-right; it has to do with shaping our future.
Enter publishing. Publishing is part of that infrastructure; it has done hugely important work in disseminating knowledge. Researchers had trouble going from Amsterdam to the Boston, or just to Milano. It was much more efficient to write a letter. Or a few letters, or when you like to research a few hundred people, to write a paper! Yeah! Because some brilliant dude (m/f) discovered printing, and making journals. No longer a need to explain your new insights in person; hundreds of others could just read it! Cheaper, faster, more efficient.
We now fast-forward to the year 2000. Moving around within Europe has become as cheap as moving around in the Netherlands. In fact, flying from Eindhoven to Manchester is probably cheaper than taking the train to Bruxelles. More interestingly, getting your new insights know by fellow researchers is done by blogging and Open Notebook Science. The publishing approaches of 100-150 years ago has become expensive and inefficient. It has no future.
Journals no longer need printing; they no longer need trucks, trains, and planes to be send around the world: we have the internet, with enormously reduced the dissemination cost.
Thus, we return to a buck well spent. The old publishing system is just old. Publishing in the world worked, and was important. Books were expensive to print, and journals were expensive to distribution; the code of these things scaled with the use. These costs, however, have been reduced enormously (not removed) with the rise of the internet and computer technology in general. Some things are still expensive, like copy/editing, doing the layout, helping authors make fancy graphics. But these things scale with the creation, not with distribution.
Therefore, the gold Open Access publishing model where you pay a fee for the getting it look pretty, rather than paying for reading it, makes much more sense to me. It makes dissemination of the research more efficient, less expensive, and a buck better spent. Every buck that does not go into a old-fashion, expensive publishing system can be spend on a smart girl cooking up new insights. 
The above makes perfect sense to me, and I am sure it makes some sense to you too. We must reboot our norming-forming-performing (terms from pedagogical theory) process, and rethink what the knowledge dissemination part of the research infrastructure should be like.
The Open Access movement has been doing that for some years now, and I have been happily supporting this. The latest iteration is a petition to the Obama administration to make a formal statement on what they think that reformulation of that research infrastructure should be like.

In this We the People petition, a new norm is asked for: that research funded by tax money is made available to everyone in the cheapest possible manner. Gold Open Access is at this moment the most efficient model I know of to do that. For me, GOA is not necessarily the only option, but key is that the major cost is now in the creation of the disseminated material, and no longer in the distribution.
The current rate of signing this petition shows that the required number of 25000 signatures is reached today. But with 6 billion people on the earth, that all want a better future, that all want a bug well spent, I like to see this petition reach ten times as many signatures: 250 thousand signature. That sounds like a much better goals. We have 2 weeks left for that, if not mistaken. And that needs your support.
If you agree, please spread the word among all your non-scientist friends and family, make them aware of the issues, and point them to the petition (or this post). It's a free world: I am not suggesting you or others should sign; instead, I explained why I did, and hope to have made you think about the situation too.
This post is CC-SA-BY and I welcome translations in any language, so that it further spreads this message. If you translated it, please let me know, so that I can link to it. Remember, you do not need a U.S.A. passport to sign this petition. Anyone from any country can sign this petition, and consider what a strong statement from the Obama administration can mean for (gold) Open Access in your part of the world.