Wednesday, September 28, 2011

Cb: Meet the new Doctor

The last year and a half have been very demanding for me and my family. Many chances, many things left behind (like having moose - älg in Sweden, eland in Dutch, and also called the eurasian elk - in your garden with apple trees; or more than 250 cranes in the fields around your house), and ultimately deciding to go back to the Netherlands (plans are barely unfolding). (Oh, those mooses just love 'apples on the rocks').

Dr Who
Another thing I (already) left behind, is Chemical blogspace. I have done this with a lot of pleasure for several years, and still use it actively to get notified about interesting papers (see this post). Peter Maas, a cheminformatician working with Specs in the Netherlands, is the new Dr. Who of Cb, donating free hours to the Cb website!. The website has been updated:

Many thanx to Peter to take over this project! The system still needs a lot of updating, robustifying, etc, so please talk to Peter if you have patches to contribute. New blogs, problems, etc are to be send to him too. Please spread the news.

Tuesday, September 27, 2011

Bioclipse-Oscar4 - Text mining in Bioclipse

Almost a year ago I started a position with Peter Murray-Rust to work on Oscar for three months (see this overview of results; a paper by the full Oscar team (Sam, David, Dan, Lezan) is pending, and I'm really happy to have been able to contribute bits to the project). Since then, I have had little time :( That's how it goes, with post-hopping, unfortunately. One thing I did do after that, was write a Bioclipse plugin.

I was asked recently via LinkedIn if I was planning a Bioclipse-Oscar plugin, and I realized that I forgot to blog about it. So, here goes. The oscar manager I implemented follows the Oscar API, and these methods are available: extractText()findNamedEntities(),  findResolvedNamedEntities().

When I wrote the plugin, I also uploaded an example workflow to MyExperiment. The code is:

// Demo showing the Oscar text mining functionality
// in Bioclipse
var html =
var text = oscar.extractText(html);
// the next step may take some time, while
// initializing the Oscar software for the
// first time
var mols = oscar.findResolvedNamedEntities(text);
var file = "/Oscar Demo/extractedMols.sdf";
cdk.saveSDFile(file, mols);;

The code will extract chemical entities, and open a molecules table in Bioclipse:

Monday, September 26, 2011

Online SEURAT workshop: LarKC: The Large Knowledge Collider

Another SEURAT-1 cluster invitation:

LarKC: The Large Knowledge Collider

Wednesday September 28th 2011, 15:00 CEST, 6:00 PDT

(organized by ToxBankSEURAT-1 DAWG)

The Linked Resources concept is a key component of the ToxBank description of work in providing a data warehouse platform for the SEURAT-1cluster. Deriving new knowledge from existing data is the task the ToxBank data warehouse will facilitate. Reasoning is an approach from the field of knowledge management that does exactly that: find new facts based on existing data.

In this presentation, Spyros Kotoulas (post-doc at the Free University of Amsterdam) will talk about the EU FP-7 Large-Scale Integrating Project project LarKC. This project has as aim to develop the Large Knowledge Collider (LarKC, for short, pronounced “lark”), a platform for massive distributed incomplete reasoning that will remove the scalability barriers of currently existing reasoning systems for the Semantic Web. To this end, the LarKC platform facilitates creating high-performance workflows for Linked Data using an open set of reusable components. We will give an overview of the current state of the LarKC platform and its capabilities as well a description of its application in data integration and Life Sciences.

The meeting will be held online for SEURAT-1 participants and will be organized via GoToMeeting. There are only 25 seats, so early registration is important, as we will fill seats on a first-come basis. However, one ‘seat’ can host multiple scientists behind a single computer.

Please send an email to Egon Willighagen ( listing your name, SEURAT-1 project, and email address. Details on how to log in will be send to that address shortly before the meeting.

Sunday, September 25, 2011

Altmetrics and the "Impact report for Institutet för miljömedicin (IMM)"

As you know (if not, and you are a scientist, then you must be living in a cave), science is discussing heavily the role of impact analysis in research assessment. What is the impact of a particular study? Is this young scientist worthy of becoming a professor? Etc, etc. There is no ultimate, one-fits-all solution, but if your institute is still using the Journal Impact Factor... (see below).

As possible solutions alternative metrics are being proposed, and even manifestos like that of #altmetrics. Some of the proposals can at least be used as a filter. Many of my readers are already using the social webs for filtering out interesting papers. You have too, because of the tsunami of publications. Solutions have been abundant, from many perspectives. I will not review all those here, and suffice with mentioned (deep links to where literature is discussed and/or proposed) Chemical blogspace, Mendeley, CiteULike, ResearchBlogging, the new Altmetrics Explorer (see below), etc, etc, etc.

These services point you to papers that others found interesting. And the more people find it interesting, to more likely it is that you will too. This is why alternative metrics do focus on the number of HTML reads, bookmarks, etc. They reflect how interesting the paper is. I doesn't say much about to future impact of the paper, but more about the potential.

The real impact of a paper is much harder to assess. Really, if you still believe the Journal Impact Factor does that, think again. The CDK project is suffering from focus on wrong metrics: it is not published in Nature or Science, but the impact is very significant: it is being used in many tools that are extensively used. So, to measure the impact of the CDK, you should take into account also the impact of research standing on the shoulders of the CDK. In a weighted manner, of course.

And mind you, even the number of citations of the paper itself already gives a much more accurate view on the impact than the JIF of the J. Chem. Inf. Mod. which is insignificant compared to that of Nature or Science. But, the CDK papers are cited well above the JIF of Nature; In fact, the first CDK is in the top 5%  of most cited literature in the world (according to the KI bibliometrics project), and the second CDK paper close to it. That is impact, I would say. And I am now just waiting to have that impact reflect on my future. While, according to this KI project, I am cited three times the world average, 2.5 times the average KI scientist, but still a mere post-doc (one, in fact, returning to the Netherlands; for family reasons, and I am going to miss my current colleagues very, very much!).

Now, various projects are popping up to visualize various aspects of impact (short and long term), such as CitedIn, Google Scholar, and a new tool called Total Impact which I ran on our IMM group at Mendeley (see this link):

I will not deny if is rewarding to see people care enough about your publications that they bookmarked them, but there is another important aspect as to why these services are important. Business intelligence is probably the proper term here. The long term impact indicators are, of course, much easier to interpret here. If you work is being cited, it is easy to check if and how your work is being used. This is harder with people bookmarking your work, but it does point you to new application areas, if you take the time every other month or so, to see what fields people are working in that bookmark your paper. That way, you get to see what research fields your work may have impact in, and thus, perhaps, be a topic of your next paper.

rrdf 1.5: Accessing SMW SPARQL end points behind LDAP authentication

We are using a Semantic MediaWiki (SMW) for the Gold Compound selection task by the ToxBank in the SEURAT-1 cluster, funded by Colipa and the EC. I do stress that despite being funded by Colipa, they have no control over my research; they just co-fund it. This Gold Compound wiki is hidden behind a cluster agreement-wall, which is implemented with HTTP Basic Auth on the front, and LDAP authentication (at some point this data will become Open) in the background. That actually combines nicely with (S)MW, and automatically logs in people into their (linked) wiki account.

Now, the great thing about SMW is that it is machine readable. It basically allows you a custom DBPedia, and I am using this to capture knowledge from the NanoQSAR literature, as blogged in Importing Nanotoxicity Data with SPARQL into R for analysis. It turned out that the SMW wiki is simply using 'basic HTTP authentication' for the part between web server and web client (thanx to chats with Nina), and LDAP between web server and authentication server. That meant that doing the authentication in Jena was trivial too, and I could simply use QueryEngineHTTP.setBasicAuthentication().

I updated rrdf to version 1.5 to support this too (see this patch; and thanx to Kurt Hornik for taking care of the CRAN incoming/). This mean that I can now extract Gold Compound data directly into my favorite statistics software R (but it would equally work with other tools that have SPARQL support, like Bioclipse), and do all sorts of fun stuff with the data, like validation, consistency checking, data mining, you name it. Like plotting pKa values (with rather uninformative segments :):

The wiki uses the RDFIO extension for SMW written by my former M.Sc. student at Uppsala University, Samuel, who presented this module at SMWCon last week.

SEURAT-1 GCWG and ToxBank members can email me for details on how to link the Gold Compound wiki to R (or other software). But it basically comes down to running command like this:

predicates = sparql.remote(
  "SELECT DISTINCT ?predication WHERE { [] ?predicate [] }",
  user="user", password="password"

Sunday, September 18, 2011

CDK 1.4.3: the changes, the authors, and the reviewers

CDK 1.4.3 is the third update in 1.4 series, aimed fixing bugs, though this series can also be updated with new functionality. The users can be assured that the project has a peer-review process, and an extensive test suite to ensure functionality meets standards. Mind you, we are still in need of many, many more tests, but that's another story.

This update release has a mix of things. First, it contain another big batch of atom types by Nimish, Asad, and Gilleain. If our counting is right, only one atom type remains. It takes the number of atom types known by the CDK software over 300. This release also contains a few bug fixes. For example, it was noticed that some test classes were not automatically run, which when hooked in resulted in 5 failing unit tests (out of a few hundred). These bugs in IReactionScheme implementations, DebugMolecularFormula, and NNMapping are not regressions, and were present in earlier 1.3 and 1.4 releases too, but are now fixed. The scope was only minor and probably therefore not noticed before. Furthermore, a thread-safety bug was fixed and setting the background color in 2D depictions.

New functionality in this release is the ArrowElement, which is fixing a bug in the CDK-JChemPaint patch, and the silent module, that will replace the nonotify module in master (if a removal patch gets accepted). Earlier timings showed excellent performance, and all users are encouraged to use this implementation when the listening model of the core interfaces is not used.

The changes
  • Ported ArrowElement from renderextra to renderbasic 8650700
  • Added missing cloning of the properties aeaa993
  • Removed add/removeListener() calls 496f242
  • Removed notifyChanged() calls 822189e
  • Implemented the silent module as a nonotify replacement, copied from the data module, with tests copied from test-nonotify. 78fb9c8
  • Added two missing tests to overwrite the notifying default implementation 294cecf
  • Added missing call to super d81fc3d
  • Fixed getting the isotope count for a particular isotope df8e31e
  • Added missing test classes to the nonotify test suite bb2e41f
  • Added missing test classes to the datadebug test suite 6bea1fa
  • Ru final 220d183
  • S final 55ed20d
  • Br final 81c629a
  • Zn final 8cf1d4d
  • V final 862b91d
  • Al final fed0c1a
  • Added Ag+ and Ag-* atom types (addressing #3388240) 033fb9f
  • Added missing test class d98ec5b
  • Se atom type test cases added 55c6a83
  • Se atom type cases added ad7753a
  • Se atom type added 02e65a4
  • Unit test for one of the current S atom types 82c2c4a
  • Ti final ee1ac09
  • Removed obsoleted code by Ni patch 682d0a1
  • Ni final 8026156
  • Sr final 458d1aa
  • Pb final 6207b4f
  • Tl final ae54e0c
  • Hooked in Mg atom type detection 81ce6e5
  • Mg final c562129
  • Hooked in Gadolinum atom type perception 00e9aac
  • Gd final 770dc99
  • Removed hybridization state: you cannot create five orbitals starting from four ef34cca
  • Sb final 9b1fe00
  • Mo final a8c44bc
  • Pt final cc46f63
  • Cu final 177e5bd
  • Fixed OpenJavaDocCheck reporting: added two missing property values 3152dd0
  • B final e159e5b
  • Ca Atom type test case added 517dd0c
  • Ca Atom type data added b91943a
  • Ca Atom type added 6c3f09f
  • No longer caches the background color, fixing that when the value in the model changes *after* passing it to the visitor it is not updated too e4786fa
  • thread safety c4067cd
The Authors
Again with the note that Nimish work is reflected by the patches by Gilleain and Asad.

21  Egon Willighagen
21  Gilleain Torrance
  3  Syed Asad Rahman
  1  Jonathan Alvarsson

The Reviewers

21  Egon Willighagen
  7  Gilleain Torrance
  1  Rajarshi Guha

Saturday, September 17, 2011

InChIKey collision: the DIY copy/pastables

About two weeks ago, the ChemConnector blog reported an InChIKey collosion detected by Prof. Goodman. Unlike the previous collision, this one was based solely on the graph and not on stereochemistry. The two molecules both have the InChIKey OCPAUTFLLNMYSX-UHFFFAOYSA-N:

The compounds are really different, the molecular formulas are C50H102O and C57H114O respectively. The SMILESes are OC(C)C(C)CC(C)C(C)CCC(C)C(C)CCCC(C)C(C)CC(C)C(C)CCCC(C)C(C)CCC(C)C(C)CC(C)CCCCCCC and O=C(C)CC(C)C(C)CCC(C)CCC(C)C(C)C(C)C(C)C(C)C(C)C(C)C(C)CC(C)C(C)C(C)CC(C)C(C)C(C)CCCCC(C)C(C)CC(C)C(C)C.The IUPAC names are useful to have as copy/pastables too (e.g. with the OPSIN-based 'Molecule from IUPAC name'-wizard in Bioclipse 2.5, which has been updated to the latest OPSIN version this week): 3,5,6,9,10,14,15,17,18,22,23,26,27,29-tetradecamethylhexatriacontan-2-ol and 4,5,8,11,12,13,14,15,16,17,18,20,21,22,24,25,26,31,32,34,35-henicosamethylhexatriacontan-2-one.

I am adding these structures to the course book and the matching Bioclipse plugin this weekend.

Tuesday, September 06, 2011

How should we draw that sulfur anyway?

I yesterday asked about the existence of a particular sulfur compound. Reference to it has not been found so far (a patent was by Chris in the comments yesterday), but only shows a related compound), but thanx to those who looked into it! Nina asked a colleague (organic chemist) who told us the compound could exist, and pointed out that the sulfur in fact has a lone pair (much like the sulfoxide) which people often do not draw. Elsewhere, David reported inconsistency between stereochemistry of the sulfur compound between PubChem and ChemSpider, which prompted Antony noted that both structures give the same InChI nevertheless.

Now, recall the lengthy write up by Brecher on correctly drawing stereochemistry (which I discussed before). I have not checked what it says about sulfurs. Also recall the discussion last week about how sensitive the InChI code is to small angle changes with stereochemistry perception (ping of there is online coverage; cannot find it right now).

Let's zoom in on this sulfur:

A clear tetrahedral stereochemistry is suggest, but is wrong. We must include the lone pair too. It's stuff like this why I care so much about hydridization hints in the CDK. The CDK does report the lone pair on the sulfur, but lists a sp3d2 hybridization, which seems wrong to me. Should that not be sp3d instead, or so?

Now, importantly, is the sulfur still chiral in the true coordination? I think it would be. The InChI library has its own opinion, causing it to give the same InChI even if the input files show a different stereochemistry.

So, my question yesterday is very much about this sulfur. That's why we need experimental data on existing compounds with such a sulfur. The compound I posted yesterday just happened to be one of the lighter compounds I found in PubChem.

The search continues...

Monday, September 05, 2011

Does this sulphur compound even exist?

Sulfurs are difficult; they do all sorts of wicked bonding. I was told QM approaches don't handle them easily either. They stink.

Cheminformatics models atoms, so sulfurs too. With the richer chemistry of sulfurs, come more atom types to model their behavior. The CDK project is currently discussing atom types for sulfur, and one is found in this compound (C8H16S):
ChemSpider has this compound deprecated, and a search will not find it. PubChem seems to have a stereoisomer (thanx to chemlynx for noting), but the question is, has this compound or one of its stereoisomers ever been found in nature or synthesized. Cheminformatics is solved, but answering this question is near to impossible, or really expensive at least. Does this compound exist? What is its melting point? What is its C/H/S NMR spectrum? Does it stink too? Does it have a crystal structure? What is the logP, the pKa? How is it synthesized? Can the structure even exist?

I will check tomorrow of my medical university has access to proprietary databases that can answer this question. I know there are some tools around to go searches on the world wide web. Please let me know what they come up with!

Sunday, September 04, 2011

"Cheminformatics? Solved."

I rarely consciously notice the Google Ads in my Gmail inbox, but I guess I do, because my eye fell on this advertisement today:

It is from one of the cheminformatics companies around, and if you wonder why I blog about it, you have a fair point. My point is that I have heard someone very, very knowledgeable person in the field (she might remember me commenting on it) say this about a year ago in Boston. As you probably assume, I disagreed. But I may be wrong, of course. So, I'll bring it up as a discussion point in my blog.

Seriously, if cheminformatics was solved, why are we predicting chemical properties so badly then? Why isn't molecular docking a push-button technology? Why do virtual screening methods fail so often at an (industrial) level (there is enough literature about this; just google a bit)? And why do so many databases get their chemical structures and mixtures wrong (or even distinguish clearly between them)?

Well, clearly, they have been using software from the wrong company. Fairly, there is something to be said there. Some software around do not take specifications seriously (count the number of tools that implement the MDL molfile to the letter; disregarding the unclear or inconsistent parts). And then there are bugs (those who claim their software is 100% bug-free cast the first stone). And user requirements (which not uncommonly lead to deliberately breaking of established standards).

Sadly, cheminformatics is a field with few gold standards. Try looking around for freely (as in speech) available data sets you can use to test your implementation against. You will not find much (and please email me anything you find; or leave it as a comment in this blog). How many software tools you see around report the prediction error in the user documentation, with full detail on validation? How many users actually ask for full disclore when they negotiate a license with a commercial cheminformatics vendor (really interesting question! I love to read some numbers based on a survey on that!)? No one really know or wants to know how well available tools 'solve' cheminformatics (except those who actually wrote the code).

Let's put all this aside; there is another aspect. Last year Rich Apodoca organized an important session at the ACS meeting in Boston on new representations in cheminformatics. Why in the world would he be doing that if cheminformatics was solved? Cheminformatics is a field of corner cases. Hydrogens always have a single neighbor, except when it doesn't. Carbons only have four neighbors, unless there is a double bond when they have three, or a triple bond when they have two neighbors. Except when they have five neighbors. Or have a charge. Or a single electron.

Now, you might wonder if the CDK is the solution here. Obviously, if I am aware cheminformatics is not solved, the CDK must do a great job at at least doing the best job it can, right?. We would love to. With "some" more funding we would have a go at it. But the CDK is behind proprietary products resulting from cheminformaticians that started long before most CDK developers have. I would not dare judging the accuracy of the "Cheminformatics? Solved." claim based on the CDK project. All we try is to be transparent in how we try to solve things.

So, that brings me back to the advertisement. Should I really belief that this company solved cheminformatics? Should I trust them to take the field seriously if they claim they did it? These are rhetorical questions, and there is no right answer. I just think the ad was badly captioned.

Thursday, September 01, 2011

Community development is one my short list of obligatory reads, and always has great analyses of Open Source projects (and I can recommend getting a subscription). Last week there was a good post on collaborative development by Jake Edge based on a talk by Clay Shirkey. It discusses some observations on how large collaborative projects work. This awareness applies to smaller cheminformatics projects too, and will help a project grow. Three principles are outlined, and one goes like:

That might appear to be a very large-scale collaboration, but it's not, he said. If you graph the contributions, you soon see that the most active contributors are doing the bulk of the work, with the top contributor doing around 500 edits of their own. The tenth highest contributor did 100 edits, and the 100th did 10 edits. Around 75% of contributors did only one edit ever.

That same pattern shows up in many different places, he said, including Linux kernel commits. These seemingly large-scale collaboration projects are really run by small, tight-knit groups that know each other and care about the project. That group integrates lots of small fixes that come from the wider community. Once we recognize that, we can plan for it, Shirky said.

This should be familiar to many of us. At least to the CDK project, which has a very small core, and too a much larger group of people who make small edits. What the above analysis does not describe, is that those small commits often can be crucial to the impact of the project. For example, the commit by Thorsten Flügel that led to a significant speed up. A small fix, but a major impact.

But, at the same time, us core CDK developers have to accept this, and live with it. This is one of the reasons that code must be peer reviewed, because after the patch is supplied, the maintenance is mostly on the shoulders of these core developers. I learned that the hard way. It's a bit like the learning process used by StackExchange also outlined in the write up.

Therefore, it's up to the core developers to educate potential contributors and make the contribution as simple as possible. GitHub, also discussed in the write up, does great work indeed. Fixing spelling errors (or adding missing period after first JavaDoc sentences) is as simple as getting a GitHub account, and hitting the 'Edit this file' button on the page showing a CDK source file, and start working.

Otherwise, the CDK community is very helpful in creating patches. You just have to ask, e.g. on our #cdk IRC channel. And, if there is enough interest, I am more than happy to organize a 'Making a CDK patch from scratch with Git and Ant' crash course.