Tuesday, August 28, 2012

Errors in Web of Science: anecdotal data

Because I needed to update my CV and publication list, I double checked my own list with my online profiles at CiteULike, Mendeley, Google Scholar, and ResearcherID (derived from Web of Science). The former two I have control over and are significantly more accurate. Peter Murray-Rust tweeted this week that Web of Science only covers some 3%; I do not know about that, but in the life sciences it is more. But far enough from 100% that also for me, not all my research output is captured by Web of Science.

Still, we all know that Web of Science is crucial to our future, because too many institutes use data derived from that database as part of their evaluation of their employees.

In that respect, I would love an equivalent of Retraction Watch: JIF-use Watch, blogging out anecdotal stories about universities, funding agencies, and the likes that misuse the journal impact factor in research(er) evaluations.

Anyway, back to Web of Science. I blogged recently about how it matches up to Google Scholar, and found a near-linear relation between the citation counts, at least partly explained by Google Scholar having a higher coverage of literature and Google Scholar being more up-to-date. In this post, I will focus on another aspect, that of error. Google Scholar captures a certain amount of non-literature, and it is yet hard to filter that out (though I trust the Scholar team to come up with a solution for that).

But one should not think that Web of Science does not have error. Despite the long delays, I doubt they actually have people doing the extraction. Why I think that? Because they miss obvious citation links. For example, I just filed two correction requests for two papers citing the Bioclipse 2 paper, but which were not linked to the entry for that Bioclipse paper in their database.

Of course, it is easy to make mistakes in list of references we put in papers. Indeed, since we have to manually enter the information commonly, or with the help of tools, mistakes creep in. For example, in our CHEMINF paper we miscited a ChEBI paper which appeared on the NAR website in 2009, but got printed in 2010 and got that year as final citation information (if only publishers would just accept a list of DOIs):

And, as you can see, six other papers make that mistake. That means that the citation count for the paper has not 53 but 60 instead. That is quite a deviation indeed. And a quick browsing of my recent papers shows that the problem of unlinked citations is not uncommon.

For example, when I look at our recent Oscar4 paper, a find a variety of problems, besides a miscite of Paula's paper: 15 more references where mistakes are made, such as taking the cited resource's title as source title, a resource incorrectly described as published in 1914, etc.

The best I can do, is send in correction suggestions. Unfortunately, they are not always accepted, and the CHEMINF paper still occurs twice in the database. I must be clear that this is a very limited study, but I doubt that these issues are really specific my situation. However, this creates a new aspect of "gaming": those researchers that spend more time on fixing Web of Science will get higher bibliometric scores. Same for journals, BTW.

If academic institutes would only consider Thomson's work the same way countries use financial qualifications like tripleA: they are opinions (strictly speaking, I'd say they are measurements, and as such have measurement errors, and we all report experimental errors, right?). Did you know that, that the Moodys of this world giving out those qualifications insist themselves they are merely opinions?

Wouldn't it be fun if JIFs would be supplied with a standard deviation? JChemInf: 5 +/- 3, Nature: 36 +/- 26. Cool, wouldn't it? Same for Web of Science / Google citation counts, such as for the first CDK paper: 198 +/- 13. My H-index: 13 +/- 2. Technically trivial. This is purely a political problem.

Sunday, August 26, 2012

Book review: G. Schilling's "Higgs - Een elementair abc over een elementair deeltje"

A few weeks ago Govert Schilling invited people via his twitter account to review his new Higgs book in Dutch. Yes, in Dutch. (The cover is shown on the right, taken in a book shop in Utrecht.)

Schilling has been popularizing science (particularly astronomy) to the Dutch public for a long time, and besides writing articles in news papers and magazine, also wrote a long list of books. Just that? No. He is a very active user of twitter, and for a while was very active by popularizing scientific concepts in a number of tweets, he called a twursus, mashing up the Dutch 'cursus' (which mean course) and tweet (of twourse). These, inter alia, have been bundled as a book called Tweeting the Universe: Very Short Courses on Very Big Ideas by Chown and Schilling.

But this popularizing in Dutch is actually of utmost importance for the level of science in The Netherlands. It always annoyed me the advantage native English speakers have in Science. Don't you wish those university rankings where compensated for that?

So, when you send out that invite, I was eager to read his book on the just discovered Higgs boson. Physics was my best topic in secondary/high school, but found chemistry to have a better complexity (or, physics was too simple). Yet, since second or third year at the university I had not done much to keep up with new things in physics. Another aspect of my interest is, of course, my own recent book writing, and measuring that up with a book by an experienced science writer like Schilling. But I underestimated the work it would take, and the review below explain why that is the case.

The book sets out with introducing us to the language of nature. This was light reading for me, as a natural scientist, and having read books on quarks, gravity, and relativity before. Some basic chemistry is discussed too, for example by outlining how the number of protons defines the element. Isotopes are skipped. From there on, it moves to quarks and the standard model. Compared to what I have read about quarks some 20 years ago (a Dutch translation of a book by Gell-Mann), the field has moved on, and one speaks now of many types of flavors, with different properties.

Comment 1: And that gets me to one downside of this book: it features only few graphics. As someone who like to works with patterns, visualization is a rather helpful tool. Instead, the book is mostly text-oriented. Yet, this is easy to fix in future editions of the book. (I understood the book is already in a second print.)

It also gets me to a second problem, but that is more of a problem for me, than of the book: I have prior knowledge, a bit outdated, a bit from a different field, and found it very hard at occasions to relate the material presented to things I know, or think I know. Some of my confusion I discussed with the author over Twitter (#twitpubinnov), such as that the Higgs field is something different than the gravity field, even though the Higgs field (or particle) gives matter mass. So, there is something that gives things mass (the Higgs boson), and something that makes things with mass interact (the graviton). And linking that the further prior knowledge: it is like there is a field that gives electrons charge, and the electric field that causes charged particles to interact. But, but... hence my headaches.

Comment 2: It is hard to find a pattern in the many fields. While Schilling gives a clear overview of all particles, and while he also hints at particle/wave duality and also mentions that the Higgs particle is like a ripple of the Higgs field (a particle/field duality?), I remain with many questions around these aspects of current physics knowledge.

The book then continues as a kind of dictionary or encyclopedia, coverting the topics atlas, boson, CERN, dark matter, energy, fermion, graviton, Higgs particle, interaction - yes, the Dutch words are *very* similar -, J/PSI, forces ("krachten"), lepton, mass, neutrino, discovery ("ontdekking"), Peter Higgs, quark, renormalisation, sigma, theory, universe, field ("veld"), competition ("wedloop"), and gravity ("zwaartekracht"). These chapters are easy to read, but have the tendency to lead to more questions. This is covered to some extend by cross-linking between chapters, but I think the learnability can be improved. In blog posts we do this by adding hyperlinks, as I just did in this and the previous sentence. Of course, the paper nature of this book is not very helpful.

There are also a number of more specific things I like to point out. Some are because I like to learn more, others because I am not sure I fully understand what Schilling wanted me to pick up. For example, the chapter on energy has an analogy of the energy of the Higgs boson (~ 125 GeV) and indicates that is about the same energy matching the mass of a gold atom (presumable via the omnipresent E = m c2. My immediate question was, that if the Higgs boson has that much energy (or, is that 'heavy'), how can it give mass to something as light as an electron?

Comment 3: the book is writing speaks physics, and is harder to read for other natural scientists. For example, the book writes that the mass of the electron is so small compared to that of protons, that it can be ignored, but in biology and chemistry it is far from that, and very important in the field of mass spectrometry. Of course, both are true, depending on the field of research you are looking at.

Some things I tend to disagree with. Not on the physics, where Schilling outranks me by orders of magnitude. I do question a statement as that made in the first paragraph on the leptop. The text reads that "only at the end of the 19th century scientists start wondering if there are smaller particles than atoms". I don't buy that. I am sure they wondered about that before; just like we wonder what the particles are that constitute quarks, and why I want to understand biology at an atomic level. Did no biologist wonder if gene-protein is not a simple one-to-one relation before that dogma was overturned some 10-20 years ago? Did not biologist wonder what further mechanisms exist in gene regulation?

Instead, what happened at the end of the 19th century is that people started being able to convert that intuition, that curiosity into testable questions. And that relates to the text on theory, where Schilling writes that theories do not have a timeless truthfulness, citing Newton's gravity law. But with all that discussion about "only just a theory" where that chapter starts out with, in particular since he pulls in evolution theory, I think it misleads the author. What the text does not outline is that some theories have shown to be false, like the flat earth, while other theories just have a limited scope, like Newton's gravity law. Newton's law is perfectly valid, given a certain context.

This is actually in important issue, that bites many current discussions, for example, in ontology development, where you can find that some ontologies conflict with other ontologies, despite both of them being true. But I will have to write more on that at some other occasion :)

Comment 4: some analogies make the matter only more complex.

The book extensively uses analogies to 'visualize' the concepts. The book starts with with the analogy of a language, and later switches to other analogies. For example, in the chapter on quarks, the interactions of quarks are explained as interactions by humans. But I have the feeling that comparing baryons to threesomes of three guys or two Swedish women and an Irish lady is distracting at most. I would have preferred to stick to the language analogy.

The sigma has popped up on the internet often, in relation to the Higgs boson, which I found rather confusing at the time. Was this the statistical sigma, typically estimated to get a feel of chance distribution in measurements? But what does the "five sigma" refer to then? Obviously, the measurement of the Higgs particle's mass is not five sigma away from the theoretical value. So, what is it then? Unfortunately, Schilling's book does not make it much clearer, but at least it confirms that the sigma refers to the statistical one. He writes that the 'five sigma' is a community agreement on the significance of a experiment in natural sciences. I guess I missed something in my statistics PhD at Radboud University, because I had not heard about it before.

This ties in to the lack of learnability of the paper book medium: you do not want to give too much detail, because you loose the reader in the one dimension you have: from left to right, top to bottom (at least in English and Dutch). Webpages like this add dimensions, primarily via hyperlink. I have still to explore how the standard deviations of measurements can be statistically linked to the chance that that measurement reflects randomness in your experimental set up. The further reading at the last page do not do justice to the amount of curiosity triggered by this book.

And that is both the power and importance of this book: you want to know more.

The book is easy to read, provides a lot of pointers around the just discovered Higgs boson. This whole discovery outlines perfectly what science is about: we make models (I prefer that terms over theories) that describe experimental results and predicts the outcome of future experiments. That core aspect of science must be communicated more often and as early as possible in the education of people (kids). That is why I am so happy that Schilling writes so many works in Dutch.

I can highly recommend all secondary schools in the Netherlands (and other countries speaking Dutch) to buy this book. I also hope that future editions of this book will be extended with both graphics and with pointers to further reading after each individual chapter, but that must not stop you from getting a copy now. The price (~8 euro) won't stop you either.

Saturday, August 25, 2012

Dear Publisher $X, ...

May my readers find use in this template.

Dear Publisher $X,

Thank you for your email. However, I do not wish hear to hear about special issues or new journals via email. Please announce those via the social webs, and remove my email from your contact database.

Your journal does not seem to be (gold) Open Access, nor accepts LaTeX as authoring environment, which makes me quite uninterested. If $X wants to start new journals, please do something innovative, instead of leaching of the academic community.

Hoping to have informed you sufficiently,

with kind regards,

Egon Willighagen

Friday, August 17, 2012

#ACSPhilly: Semantic pipelines to molecular properties

It is a great pleasure (and honor) to be able to speak at the upcoming Herman Skolnik Award Symposium (where Peter Murray-Rust and Henry Rzepa will be awarded). Unfortunately, I will not be present in person and will give my presentation remotely. I will put up the slides here, as normal, and hope that Google+ Hangout will work next Tuesday :)

Here's the abstract of my talk:
    In our quest to replace answers in molecular sciences by recipes to get answers, the semantic web technologies play the important role of giving meaning to numbers and characters. The Resource Description Framework (RDF) complements (and not replaces) earlier work with eXtensible Markup Language (XML) applications by providing a more clear separation between syntax and meaning. This creates an environment where multiple serialization formats can be used, that grows and shrinks in complexity where needed, and, for example, that can be easily embedded in document formats like HTML. We here present recent work in the dissemination and prediction of molecular properties, where data is shared in RDF, read into statistical and life science software including Bioclipse and R, and where molecular properties are predicted.
These publications serve as background material, and I know you will have read them, because you have all access and no excuse not to read them (also check this special issue around the ACS RDF 2010 meeting):

Samwald, M.; Jentzsch, A.; Bouton, C.; Kallesoe, C.; Willighagen, E.; Hajagos, J.; Marshall, M.; Prud'hommeaux, E.; Hassanzadeh, O.; Pichler, E.; Stephens, S. Journal of Cheminformatics 2011, 3,19. [Open Access]
Willighagen, E. L.; Jeliazkova, N.; Hardy, B.; Grafstr√∂m, R. C. BMC Research Notes 2011, 4. [Open Access]
Hastings, J.; Chepelev, L.; Willighagen, E.; Adams, N.; Steinbeck, C.; Dumontier, M. PLoS ONE 2011, 6, e25513+. [Open Access]

I'll try to make the presentation good fun: controversial, challenging, etc. But first I need to figure out what that Chinese room is about.

Sunday, August 12, 2012

Creating QSAR models in #Bioclipse with #OpenTox

Of course, the Bioclipse team in Uppsala has been working on QSAR and proteochemometrics in Bioclipse form the start. But OpenTox (doi:10.1186/1758-2946-2-7) can generate (predictive) regression models too (it can do a lot). And we integrated Bioclipse and OpenTox before (doi:10.1186/1756-0500-4-487).

So, when Nina asked me about exposing the QSAR model building functionality of OpenTox in Bioclipse, I had a look at it. Because I had not hacked on the Bioclipse-OpenTox code much recently, I set out to add a few more unit tests. These are automatically run by the Jenkins installation. The number of unit tests doubled to some 52 tests, but the new tests also uncovered two regression. One problem was the listCompounds() was not working anymore, and the other was addMolecules(List<IMolecule>) was incorrectly names in the implementation, causing one to not be able to call that method. Both are fixed now.

However, at the end of last night, I was feeling comfortable with the code again, and hacked up a function to be able to create QSAR models with OpenTox:

When I tested this on Nina's AMBIT installation (doi:10.1186/1758-2946-3-18) this, it nicely created this model:

The opentox.createModel() method takes four parameters. The first one is the regression method to use, the second the data set to use as training data. The third parameters refer to the features to be used as independent variables (x data), while the last parameter is the feature with the dependent variable (y data).

The stuff is compiled with Jenkins, and should be available as update in your Bioclipse installation.

Thursday, August 09, 2012

Keeping up with Literature: Google Scholar suggests interesting reads

CiteULike has been recommending interesting papers for a while already, but Google Scholar has now introduced such functionality too. And it works, and has a nice touch that is shows me if that interesting paper is citing my work. In the below list, we see that three papers are citing my papers:

Very well done indeed!

Google Scholar devs, if you're reading this, please make this citation info available as CCZero via an API, just like Nature does! That would be a real game changer, and a enormous boost for scientific research!

Jonathan Eisen also blogged about it and commented on the fact that not all papers are equally useful. But similarly to detecting spam, I expect Google Scholar will have no trouble implementing a feature to tune "my are of interest" after I clicked a button "not interesting".

BTW, I do appreciate that people will not even join Google Scholar despite of this useful functionality (pretty much why I do not like to join FaceBook), but for Google Scholar I am more than happy to have it know my social network of fellow scientists.

Tuesday, August 07, 2012

Twitter in Publishing Innovation #twitpubinnov

For some time I have been longing for better integration of publications with online science. That is, conferences must not be allowed to monopolize discussion of new existing research. Indeed, the blogosphere and later twitterverse have shown great places to discuss new research, very much like we do at conferences, but then instantianeous and in an open, low-barrier manner. And this is show to work for papers (think #altmetrics) but also posters (see

And scientists are on twitter (see e.g. SciencePond). I do not know the exact numbers, but I remember someone saying that one in three US scientists are on twitter (source?). And, importantly, Twitter is a much better channel than email for online discussion. Certainly not the only one, and I would very much like all scientists to join, the free twitter-like network. But it certainly beats other social accounts that are not primarily linked to communication, such as a CiteULike or Mendeley account.

Adding the Twitter handle in addition to email in publications as means to reach the authors, sounds like a perfect idea to me. Graham Steel (@McDawg) liked it too when I suggested it, and importantly, followed up on it (because ideas without action are meaningless):

And as you can see from the comments, some editors are liking the idea too. So, time to push the idea a bit more, and start a #twitpubinnov campaign.

So, if you like this idea too, and you also want to list your Twitter handle, along with your email on your publications, please visit this tweet and retweet it. Blogging is perfectly fine too, and please use the hashtag as much as possible, and let's get this viral!

Monday, August 06, 2012

CDK I/O options in the JavaDoc

The CDK readers and writers have options to customize that input and output. For example, the MDL readers have the option to read T and D as hydrogen isotopes, something not supported by the format specification itself.

With the help from Stefan Ferstl at StackExchange I got a Taglet set up to output all IO settings to the JavaDoc:

The full patch is in preparation, and will be submitted soon.

Thursday, August 02, 2012

All-round bioinformatics system administrator, Maastricht, the Netherlands

Our BiGCaT group is looking for an all-round bioinformatics system administrator, Maastricht, the Netherlands to support our bio- and cheminformatics research on various Open Source projects, including the Chemistry Development Kit, WikiPathways, PathVisio, and more.

Details of this vacancy can be found in the Jobs section of BioStar (see the screenshot on the right). Core is, we need an enthusiastic systems administrator, who loves to keep fancy servers online (mostly GNU/Linux servers) as well as the services running on those, in collaboration with our scientists working on fluxomics, semantic web for the life sciences, drug discovery, and much more.