## Sunday, July 31, 2011

### Groovy Cheminformatics 3rd edition

Update: the fourth edition is out.

I am starting to get the hang of this publishing soon, publishing often thing, and just uploaded edition 1.4.1-0 of the Groovy Cheminformatics book. The cover is the same (with one typo fix), and the content is 20 pages thicker. True, six of those pages are isotope masses of all natural isotopes. That leaves 14 pages with this new content:

• Section 2.7 on line notations with 2.7.1 about reading and writing SMILES
• Section 6.3 about Sybyl (mol2) atom types
• Section 7.4 on atom numbering with 7.4.1 on Morgan atom numbers, and 7.4.2 on InChI atom numbers
• Chapter 9 on molecule depiction with the new rendering code, with
• Section 9.1 on drawing molecules,
• Section 9.2 on rendering parameters, and
• Section 9.3 on the generator API and how to add custom content
• Section 11.4 on calculating aromaticity
• Appendix A.2 listing all Sybyl atom types
• Appendix B listing all naturally occurring isotopes
Features requests most welcome.

## Tuesday, July 19, 2011

There is a rather interesting, and very important discussion going on on the Blue Obelisk eXchange. What should people do with articles that are clearly wrong, in their opinion. The question highlights a paper where homology modeling was incorrectly performed, identifying the wrong active side, and consequently rubbish QSAR model.

The question is about how to take this to the community. Get a commentary published in the journal? Take it to the blogosphere? Others joined in with additional stories, confirming that cheminformatics is in no way other than other fields, with respect to bad papers. Retractions is very unlikely, but commentaries are very uncommon in our field indeed, despite we all have our top five of bad papers.

Peer-review is no longer the answer. As wdiwdi describes, it is all too easy to just submit the paper, unmodified, to a lesser journal. And honestly, nowadays, when reading the ToC of the top journal in our field, the JCIM, I skip more than 80% of the papers ("Oh no, not another 15 structure QSAR-docking paper"). I recently, very briefly though, heard Wendy Warr (I value her opinion very much, and try to read whatever from here for which I do not have to pay personally) say that cheminformatics representation is a done business (in slightly other words), but doubt that can be true (and tend to disagree), if we see so much literature where we question the usefulness, even in the JCIM.

Option 1: Write the journal with a commentary or letter.
It seems, however, this is not frequently done. Indeed, I chose not to do this either. The barrier is large: suspicion is not good enough, and you have to provide rock-solid proof. Ah, but that is hard, in a field where reproducibility itself is at a low level (see our Blue Obelisk paper), though there is increasing awareness (the Blue Obelisk paper is in the 5th most cited in the journal in 2010, see this paper). Actually checking the JCIM author guidelines the journal does not, in fact, have a manuscript type for this kind of community response. Neither does MolInf seem to have one. The JChemInf does have a commentary type, but note that these are normally commissioned by the editorial board. But, as Antony points out, editors may be objecting against community replies on articles.

Some newer journals allow you to just comment on a paper. This is a very simple but effective approach to making community feedback possible.

Option 2: Use the power of the blogosphere.
This is a tried-and-proven mechanism. It has shown to have major impact in chemistry, with the sodium hydride story as most prominent example. If properly crafted, and taking advantage of services like ResearchBlogging, this can have a major impact. But, it has the downside that long term visibility is not necessarily preserved. Short term visibility in the browser can be organized with userscripts (as we have shown in this paper), and PLoS journals make the blogosphere part of their article level metrics, with ResearchBlogging again as important intermediate.

Why?
Why should we not just ignore those papers (as we now do), but act on them? I think that a good, high-level communication in the field is important to make this field go forward. It is my personal feeling that cheminformatics has effectively come to a halt. New generation cheminformaticians have great trouble finding good education with most research groups and education facilities in Europe closed down (what is the situation in the USA?), resulting in a field where people decreasingly know how to properly evaluate research, reducing the level of peer-review, etc, etc.

So, I think a spark in the community re-enforcing good quality cheminformatics papers is critical to the future of the field. And I do hope the people who contributed to the Blue Obelisk eXchange question so far will start writing about the papers they did not like. I am going to closely follow the discussion, and looking forward to what comes out of it. I also like to invite the community to start blogging: that is, what do you feel is needed to improve the quality of cheminformatics literature?

I would say, yes, we clearly need a better alternative for the current peer-review.

## Wednesday, July 13, 2011

### Data, Nonotify, or Silent?

I cannot find the bug report just now, but the CDK has an open problem with change even notification, where the nonotify classes still caused change event to be sent around.

This was because the nonotify classes extended in a wrong way the data classes. So, I worked today on copying the data class implementations into a new implementation, not extending the data classes, while removing the listener code: the silent module. I'm not entirely done yet, but close enough to blog about it. While checking things, I ran the cheminfbench code on it, with these results:
So, removal of the notification listening improves the performance, when reading a 416 entry SD file. I think the difference will be more significant for other tasks, like ring finding.

But, but...?!?! Yeah, this is a rather weird plot indeed... the blue bar should also be lower than the red one! And it used to be too... :( Bad regression... hard to unit test too :(

OK, back to some final clean up.

Update: the clean up is done, and I have now run the fingerprint benchmark from cheminfbench using the new module and nonotify. In a situation when change events are much more used (as is with fingerprint calculation), we see that nonotify still improves speed, and that the new silent module shows about the same speed up. We also see that the 1.4.x classes are a bit slower than one classes of some 20 months ago. That probably reflects bug 2992921 that was recently fixed. The full bar plot:

Red and blue are CDK 1.2.x (as the plot legend says), green and yellow the same for CDK 1.3.x (and both clearly faster than the 1.2 series, and purple an light blue the same for CDK 1.4.0. The last bar is the new silent module, a tid bit slower than nonotify.

Update2: OK, one last update. The performance difference can actually be larger than this. The below screen shot shows the effect of the silent module (blue, yellow) on SMILES generation (without and with lower case formalism, red and green respectively):

If you did not get it yet, if you bring your system to production level, do not use the default implementation, unless  you really need to change modifications.

### CDK Forks

Forking is an important part of Open Source development, and forking is good. Of course, forks should interact too, and genes from one fork should merge back into another fork. Forks are probably also a good indication for the success of a project: if a project is forked, it means it is significant. On the other hand, it can also mean that the main project is too hard to work with. Maybe the CDK is that. Indeed, it's easier to not have your code peer-reviewed, and just fork. That is freedom. (There might be other reasons too.)

The CDK is forked. Forked several time, in fact. I have now started a tracker on SourceForge to aggregate information about these forks, and the state with respect to back-integration of code into our fork. I was aware of the AMBIT fork for a long time, as one of the authors (Nina) has contributed. Of the others I only learned via publications (PaDEL, ScaffoldHunter), and in case of Craft, it was a personal ping that made me aware of it. Craft is all the more exciting because the distributor, Molecular Networks, is primarily know for their proprietary products.

Porting all this code back into the main CDK library is not trivial, and often a lot of work. The current core CDK development team will not be able to do this, and the project relies here on contributions from other to do the integration, and convert code from those forks into proper patches. This is likely interest driven, which is one of the reasons why I started the new tracker. The entries report (briefly) at this moment what interesting functionality is available from those forks, but feel free to add comments with detailed information, such as class names that provide that functionality, so that the CDK community can share the burden of reintegrating this code.

OK, enough for now.

Jeliazkova, N., & Jeliazkov, V. (2011). AMBIT RESTful web services: an implementation of the OpenTox application programming interface Journal of Cheminformatics, 3 (1) DOI: 10.1186/1758-2946-3-18
Wetzel, S., Klein, K., Renner, S., Rauh, D., Oprea, T., Mutzel, P., & Waldmann, H. (2009). Interactive exploration of chemical space with Scaffold Hunter Nature Chemical Biology, 5 (8), 581-583 DOI: 10.1038/nchembio.187
Yap, C. (2011). PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints Journal of Computational Chemistry, 32 (7), 1466-1474 DOI: 10.1002/jcc.21707

### CDK URL shortener: openbugs and openpatches

I finally got so annoyed with SourceForge forgetting my filter settings when switching between the CDK Patches and Bugs trackers, that I set up the URL Shortener functionality and created to short URLs for quick access to all open bugs and all open patches:
This is the current full list, but any CDK admin can add more. So, send in your requests.

## Wednesday, July 06, 2011

OK, now that you have seen the outcome, I'll give a short walk through on how the data ended up there.

First, I registered. Easy. No OpenID yet, and I do hope they will add that. But you already got one, because you were so keen to test the SPARQL end point for the ChemPedia data, right?

Next, I added a data set. Or better, I added an entry for the data set, as the data is only added later. I added a name, description, the selected the Science category, the license, and left the rest empty.

The next step is to subscribe to all five APIs yourself, which you can do with the button on the right:
I skipped the Upload Data button. I used curl instead. The actual command I used (except I used my real API key), looks like:
curl -S -v -H Content-Type:application/rdf+xml \
-d @substances.xml \
http://api.kasabi.com/dataset/chempedia-rdf/store?apikey=XXX

This command uses HTTP POST to send the content of the substances.xml file to the given address, using the -H option to set the mime type of the content.

I created the substances.xml with the same script I as before, with the important differences that:
1. the resource URIs must have the domain data.kasabi.com and complemented with dataset/chempedia-rdf. Then Kasabi will pick this up and without further work make it available as Linked Open Data
2. the RDF should not use anonymous resources (aka blank nodes)
I updated my Groovy script accordingly, and uploaded the RDF/XML with the above curl call.

In fact, the first RDF/XML I uploaded did not have those two changes, but Leigh Dodds explained me what I had to do to make the Linked Data feature going. So, I had to delete the original data, which requires you to reset your data set, which you can also do from the command line, with:
curl -S -v -H Content-Type:application/rdf+xml \
-d @reset.json \
http://api.kasabi.com/dataset/chempedia-rdf/jobs?apikey=XXX

Where the content of the reset.json file looked like:
{
"jobType": "reset",
"startTime": "2011-07-06T12:08:00Z"
}

After I reuploaded the new RDF/XML, the resources could be dereferenced was working nicely. It's not perfect yet, and I think I will tune things a bit more, and start using CHEMINF too. But, if not mistaken, I have now qualified for this badge :)

### ChemPedia-RDF #2: Kasabi

Kasabi is a new, RDF hosting service by Talis. It's still in beta, and I have been testing their beta service with the RDF version I created of ChemPedia Substances (the now no longer existing cool web service from MetaMolecular to draw and name organic molecules).

Kasabi makes the RDF data available via a few APIs, depending on the APIs selected by the uploader. I picked all five of them, just to see how things work. Of direct interest are the SPARQL end point, but also the option to host the data as dereferencable resources. Cool! That was just what was missing for me.

Now, using the API requires you to get an account. This will allow Kasabi to control the traffic, and as such creates a business model around providing services around Open Data. I think this approach will work. But just to make clear, this does mean you need to get an account first, if you like to play with this data. Once you got an account, you get an API key, and you can append that to any URI with ?apikey=XXXX to authenticate yourself. I think this does mean Kasabi will have to go to a https connection, which is not yet the case. Moreover, you will need to subscribe to the data set too. That, in fact, with #altmetrics in mind, sounds really interesting :)

The ChemPedia RDF data is available at:

http://beta.kasabi.com/dataset/chempedia-rdf

This web page will give the five APIs, of which the augmentation one is really interesting, but I have not played with that yet to say much about it. The idea of that API is to augment RDF you post with data from the data set. Like in a augmented reality. That should be cool for mashups.

Now, the APIs I do understand include this SPARQL end point (remember to add your API key!):

http://labs.kasabi.com/explorer/sparql/sparql-endpoint-chempedia-rdf

And the Linked Data feature. In the next post, I will explain how I tweaked the original data, how I uploaded it, and how this resulted in the dereferencable resources, like:

http://data.kasabi.com/dataset/chempedia-rdf/substances/2-2595-7562-8125.html

Note the links for RDF/XML, RDF/JSON, and Turtle, directly accessible by replacing the .html extension with .rdf, .json, and .ttl respectively. An API key does not seem required for this, which makes perfect sense.

It took me some chatting with the people from Talis, who have been very helpful, as the whole platform was a bit overwhelming. But, for the first time ever, I actually got Linked Open Data online, in a Linked Data manner.

## Sunday, July 03, 2011

### SMILES generation

I'm updating my Groovy Cheminformatics book, and hope to release the third edition in a week or so, based on the CDK 1.4.0. In fact, nothing has changed that makes the 2nd edition outdated; the new edition will add a few sections, that's all. It will include a new chapter on 2D rendering based on two earlier blog posts, a code example for aromaticity detection, and a section on how to read and write SMILES, as shown in the screenshot.

### CDK 1.4.0: the changes, the authors, and the reviewers

The time has come. With the help of Rajarshi, Gilleain, and others, the last few important glitches have been removing, including a few found since 1.3.12. This post will not discuss all the new stuff in 1.4.0 since the 1.2.x series, and such we will see soon enough, I guess. Instead, like all those other TCTATR posts, I will just list the changes, the authors, and the reviewers for this particular release. Also special thanx to the various users who sent in bug reports and even small patches, like Dmitry and Jonty!

Since 1.3.12, mostly bug fixes have been made. Glancing over the below list, I do not see much that really stands out. Here is the full list:
• Updated unit test to test of all those elements are still present, but more too 1f54188
• Fixed potential NPE. Also moved the debug statement inside the loop so that each atom in the list is reported. 3ad6073
• A test class for the atom placer, and a couple of unit tests 5438093
• Unit tests for helium and americum for MF string generation d2a2162
• MolecularFormulaManipulator: added element symbols in generateOrderEle* . - added symbols up to and including "Cn" (#112) - updated accompanying JavaDoc - fixes CDK bug #3340660 c248d16
• MolecularFormulaManipulator: fixed whitespacing / bracket style. "Whitespace only" - consistently indented with tabs (was mixed) - made javadoc indent consistent - made {}-use in if/else-if blocks consistent - made 'if(' vs. 'if (' consistent (also for else/for) df59d71
• Fixed spelling error, and added useful whitespace bb51081
• Fixed bug in RingGenerator: it now returns the parameters from the superclass too 185520e
• Updated unit tests according to cdk-jchempaint ML: we expect Line and Oval elements df7c7f5
• Updated unit tests according to cdk-jchempaint ML: we expect 1 AtomSymbolElement 1d2d707
• Fixed various JavaDoc errors 789a6a4
• Fixed false positives about missing Jena classes, by including those jars to the classpath too da493da
• Removed @inheritDoc because the method does not override any method 5ccf7d7
• Replaced cdk.author with just @author e41a9c7
• Replaced cdk.svnrev with cdk.githash d2e502e
• Ensure that we set the diagonal values of the Burden matrix correctly. Fixes bug 3347528 b008a4b
• Updated so that the writer does not fill in the valency field in the atom block by default, and added an IO setting to trigger writing. 20f3639
• Updated boron unit test to just check that we parsed boron d77c936
• Fixed the BasicSceneGeneratorTest: nothing is drawn by this generator -> expect zero elements; test the right class b0c50cd
• Typo fix 8eba14e
• Fixed returning of the descriptor result type: actual length 5049178
• Extend the MolecularDescriptorTest ac02901
• Added missing test annotation 4601c4b
• Code clean up: use IMolecule interface, and properly typed List 6ae652f
• Fixed potential NPE. 280381d
• Added missing module testing a1b4961
• Updated SMILES parser consider aromatic boron part of the organic subset. Added a test case for bug 3160514 6a25800
• Updated Pubchem fp SMARTS patterns in response to Andrew Dalkes bug reports 130fa04
• Removed non-existant dependencies a8ceaa0
• The setAtoms() method itself throws a change even too, so the listener must be reset *after* that call. 0e1b952
• Unregister the listeners for the global atoms, not the local one 1ed28ee
• Removed old identifiers, incompatible with git 44a3d4a
• Send around a change event when flags are set (fixes #2992921) 8eed94c
• Overwrite the setFlags() test for notification for the NoNotification classes 0f3080b
• Added unit tests to verify that both setFlag() and setFlags() give a change event, addressing bug #2992921 01161d6
• Updated the copyright list to reflect the descriptors history 8539e23
• Removed 'this' as listener from the atoms no longer in this container 6237571
• Implemented actual reading of CDK/N3 files f06dcb6
• Extended test to highlight that I forgot to implement the actual read method b11336a
• Merged two methods (fixing #3089188) 7b46968
• Improved error message to show what character the parser was trying to interpret as symbol 555f856
• Fixed error message suggestion to have an upper case element symbol in brackets (addressing #3160514) cf47901
• Overwrite the setAtoms() test for notification for the NoNotification classes f5b1818
• Added unit test for bug #2993609 for not removing listeners with setAtoms(IAtom[]) cd07ebc
• Fixed order in assertion: expected value comes first 3c15ff3
• Updated the expected fingerprints for the hybridization fingerprinter 49fd969
• Code clean up: use generics 1d6aade
• Some code clean up: generics and one variable now starting with a lower case char dff478c
• Use the HybridizationFingerprinter: faster, not suffering from aromaticity 89b0f1e
• Added unit test to see if descriptors give proper identifiers, and not the $template b2275db • Added missing @TestMethod annotation 6bf07ac • Updated copyright statement, following the git log 19218e6 • Added .gitattribute files to have the$ fields for the descriptors specifications updated again a4bd711
• Added a missing test class to the suite b14795d
• Moved the reaction and descriptor ontologies to the dict module 530f02e
• Added a missing dependency e847231
• Removed dictionaries that are in the atomtype module fd54ca2
• Updated for chiral SMILES parsing: @@H-like statements yield an explicit hydrogen, changing the number of expected atoms and bonds, and the index of the charged sulphur 9413571
• Fix in the annotation-based coverage testing: if classes do not have an explicit constructor, they gave a false positive in the coverage testing. I am now talking advantage of the annotations array to be empty, for the implicit constructors Java adds itself, though private constructors have no annotation either. However, those do not need testing, as they are already typically indirectly tested. 4737199
• Fixed ClassCastException, by properly 'converting' an IAtomContainer into a IMolecule (yes, yes, I know, we're going to drop the IMolecule interface later...) a4fd02c
• Added a missing dependency on atomtype, introduced by Mol2WriterTest 673f754
• Added @cdk.bug annotation, and removed output to STDOUT 9048e78
• Fixed Mol2Writer to also accept NNMolecule, etc fac6ab5
• Fixed unit tests, to match the current implementation d33c697
• Added a test for IMolecule.getLonePairCount(), casting to an IMolecule 648f4ca
• Arom detection was enabled, and no casting to .2 needed anymore. 0b8581d
• Added missing unit test in io module: Mol2WriterTest bfacac5
• Updated SMILES reader so that we can specify a builder object. Using NN builder speeds things up for large SMI files da0a62a
• Added a test case for bug 3315503 to ensure that Mol2Writer is not throwing an NPE when faced with an unknown atom type. Also added test data file. aa35eae
• Updated Javadocs. Fixes bug 3322592 63619c2
• Updated Javadoc to fix a variety of Javadoc errors (see bug 3322594) 57d3199
• Updated Javadoc to fix a variety of Javadoc errors (see bug 3322602) 1043b3d
• Fixed typing, so that we work with IAtomContainer rather than IMolecule b74143c
• Added missing dependency of test-dict on vecmath.jar 98c4855
• Removed non-existing dependency declaration: qsarprotein does not depend on diff d7a229a
• Removed obsolete meta info: sinchi module no longer exists d4530d0

The Authors
The high numbers are explained by the fact that we were in bug fix mode. Many small, simple patches have been applied, at a very rapid pace. In fact, we have been so active, we reached the top 50 most active projects on SourceForge last weekend!
65  Egon Willighagen
13  Rajarshi Guha
2  Dmitry Katsubo
1  Jonty Lawson
1  Gilleain Torrance


The Reviewers
32  Rajarshi Guha
8  Egon Willighagen
7  Gilleain Torrance


## Saturday, July 02, 2011

### The KEGG subscription model

KEGG's primary funding ran out, and they decided to go for a subscription model, as you likely will have picked up by now. KEGG has been used a lot by many, likely largely caused by it being freely available before. But, KEGG is not Open Data, and this will slowly be realized by lots of biologists and bioinformaticians who will now have to pay from 2000 up to 5000 dollar.

The rationale is simple. 1. Funding ran out; 2. curation is expensive; 3. money is needed for continued evolution of the data. The next year will be very interesting for a number of reasons, some of which are like seeing the GPL being taken to court.

First of all, it is of utmost importance that the subscription supports the future development of KEGG, not the hosting. In fact, many have made a copy of relevant bits from the FTP site before it closed down. This data cannot be shared, but it is nevertheless. That takes us to my second observation: KEGG data is all around. Many sites are already (and have been for some years) redistributing the data, such as Bio2RDF and Chem2Bio2RDF who provide that data via a SPARQL end point, or otherwise. In fact, there are still dozens of places where you can download the KEGG data freely (as in free beer!).

Closely related to this is that multiple independent academic groups are using KEGG data, and have set up new metabolism-related websites, including the Human Metabolic Atlas, BioMeta, and many, many more. On top of that, there are many alternative database which provide the same kind of information, which will attract the lurking bioinformatician who does not have 2000 dollar to run a quick pathway enrichment test. That is, KEGG is in a market with a lot of competition. Though, the KEGG brand is strong, and could be enough for a vendor lock-in effect. (Group leaders may say "WTF for did you not use KEGG instead of this beta-brand database? Nature will never accept that!!". Of course, you could attempt starting a discussion about data quality, validation, BioMeta, ... good luck with that :)

What can KEGG do about protecting their IP? Well, as they never gave formally permission to redistribute the data, they might go after competing efforts which have used KEGG data. Will KEGG? I do not know; I hope not, because the bioinformatics community will probably object, driving people away from KEGG instead. Accept that situation then? I do not think that will work either, because lurking is just a sad fact of life science informatics: people take easily, but contributing back takes an effort.

What I hope will happen, and that is probably what KEGG is anticipating, is that all those derived databases will in fact take a license, though I have to say 5000 dollar is not much then, nor did I read anything about that allowing these derived databases to redistribute under such license.

My personal preference is a Open Data approach, where KEGG will work together with the other databases. However, political forces may be inhibiting this. How large is the chance that the Human Metabolic Atlas will drop their brand and join a KEGG consortium? How large is the chance that existing efforts will agree on a license?

Another thing that might happen is that KEGG will slowly disappear from the scene. Maybe people will realize that Open Data is in fact an important way to simplify international collaborations. Maybe Open projects like WikiPathways will now be preferred. Maybe we will see an Open Data KEGG commons, with branded web interfaces around this. The time is right. Open Access is booming, and Open Data is up next, and high on the list too. The question is how soon the biologists and bioinformaticians follow. Open Source, after all, is mostly liked because of the free beer by these groups, not because of their free speech character.

## Friday, July 01, 2011

### Running 2.6 GB of ChEMBL data through the CDK Atom typer with CDK-Taverna

CDK-Taverna 2.0 with Taverna 2.0 can in fact do this. Major technological improvement! Also, you get visual feedback on how far it has progressed. It still is rather unpolished, but I am happy to see progress!