Showing posts with label science. Show all posts
Showing posts with label science. Show all posts

Saturday, April 05, 2014

Every PhD student must use Git (aka research data management)

Last Thursday and Friday the SURFAcademy Masterclass Research Data Management in Nederland took place, and Chris Evelo and I presented some biology-world use cases. He focused more on the larger projects (e.g. ISA-TAB, GSCF, and FAIRPort) while I exposed my day to day data management. My day to day work habit looks more or less like this.

Day 0 is to think about how to do it, but the answer is pretty simple: use a version control system, like Git. Because it tracks every bit of what you do, allows for easy back ups, and makes it easy to continue working on a different machine in case you forget to take your laptop adapter home :)

  • Day 1: keep an electronic lab notebook (e.g. a version control system; read Git from the Bottom Up)
  • Day 2: carefully select data you build on (can you indeed share it with the rest of your arguments in your next paper?)
  • Day 3: do you research and store everything
  • Day 4: integrate data repositories in your data analyses, e.g. rrdf and knitr
  • Day 5: if you like scientific dissemination, collaboration, and progressing science, share your data in public repository, like FigShare, Data Dryad, Dutch Dataverse, 3TU.Datacentrum, DANS, etc. (that's a lot of D-D-D-Data...) or in a domain specific database, like WikiPathways, XMetDb, or DrugMet. And data copyright and licenses and particularly, whatever you chose, be explicit about it and don't let others guess (wrong).
  • Day 6: think ahead of reuse, and suitable formats. Consider semantic web and linked data.
  • Day 7: did you get impact? Think DataCite, ImpactStory, and Altmetric (and ORCID and DOI along the way).
And here are the slides:

Friday, March 28, 2014

Breaking News: CC-NC only for personal use!

What some of us already interpreted is that the Non-Commercial (NC) clause of the Creative Commons (CC) is a killer. German court has ruled that the NC clause means that the material is only for personal use. And that is literally breaking news! It means that such material is not Open Access in the context of (European) universities. I learned from Lessig's Free Culture (a must read) that academic use falls under fair use under USA law. but as far as I know this is not the case in Europe. It effectively means that all journals using a CC license with the NC clause now officially do not fall under most Open Access directives (AFAICS but IANAL).

(Image from WikiMedia.)

Monday, February 24, 2014

Journal rankings and Expectations

Publishing to me is getting the message out. I want to contribute to a better world, via my work. I do this by making methods more precise and by reducing error. You probably spotted that theme in my publications. The impact of my work is reflected by how often people reuse my work (extend, use, ...). A flawed reflection of that is the number of citations of my publications; better would be the number of projects that use my work. For example, tools that use the CDK (which has many more authors!), like Bioclipse, AMBIT,  LICCS (CDK in Excel), KNIME, and many more. Actually, the number of times these projects get cited, indirectly also contribute to the impact of my work.

"We" don't count that. Instead, "we" tend to focus on the Journal Impact Factor (JIF). There is plenty of material around that discusses how flawed that is in assessment, particularly for assessing scientists. But I rather explain now why I still think about it. Because it is part of how I, our research group, our research institute, and our university is assessed. I do not have to agree with the overhyped role of the JIF, but I cannot go around it either. Well, actually, I can to some extend.

At the moment, I have identified the following methods our work is being assessed, e.g. by national funding agencies and organizations that want to ensure good science in The Netherlands.
  1. papers in journals with a JIF >= 5
  2. papers in journal that rank in the Journal Citation TOP1, TOP10, and TOP25
I have seen worse. I have attended universities where the JIF is directly involved in the calculation. I have also seen better, and have attended universities that do actually count the number of times my papers are cited, and still proud to have five papers in the top 5% most cited papers.

Now, when using these guidelines practically, you found some interesting facts. For example, only very few journals are TOP1. In fact, Science is not; it is (JIF >=5 && TOP10). Not so many journals have a JIF >= 10, but >= 5 is not that uncommon. The biggest problem here is that this cut off is field specific, and >= 5 is trivial in biology, but hard in chemistry.

Combine that, you get a situation where many Open Access journals (and OA matters to me) are JIF >= 5 && TOP10, putting it, for the rankings, at par with Science.

Now, it gets better. Taking the above assessment rules into account, have a look at PLoS Biology. It has a JIF >= 5 (>= 10 even) and is TOP1! Thus, it outranks Science. Of course, there is also the implicit rule that Nature and Science go above all, but maybe the assessors should start to think about what they really want to care about, and what they really should be expecting from scholars.

Now, go read my latest preprint instead of wasting your time on rankings.

Sunday, August 26, 2012

Book review: G. Schilling's "Higgs - Een elementair abc over een elementair deeltje"

A few weeks ago Govert Schilling invited people via his twitter account to review his new Higgs book in Dutch. Yes, in Dutch. (The cover is shown on the right, taken in a book shop in Utrecht.)

Schilling has been popularizing science (particularly astronomy) to the Dutch public for a long time, and besides writing articles in news papers and magazine, also wrote a long list of books. Just that? No. He is a very active user of twitter, and for a while was very active by popularizing scientific concepts in a number of tweets, he called a twursus, mashing up the Dutch 'cursus' (which mean course) and tweet (of twourse). These, inter alia, have been bundled as a book called Tweeting the Universe: Very Short Courses on Very Big Ideas by Chown and Schilling.

But this popularizing in Dutch is actually of utmost importance for the level of science in The Netherlands. It always annoyed me the advantage native English speakers have in Science. Don't you wish those university rankings where compensated for that?

So, when you send out that invite, I was eager to read his book on the just discovered Higgs boson. Physics was my best topic in secondary/high school, but found chemistry to have a better complexity (or, physics was too simple). Yet, since second or third year at the university I had not done much to keep up with new things in physics. Another aspect of my interest is, of course, my own recent book writing, and measuring that up with a book by an experienced science writer like Schilling. But I underestimated the work it would take, and the review below explain why that is the case.

The book sets out with introducing us to the language of nature. This was light reading for me, as a natural scientist, and having read books on quarks, gravity, and relativity before. Some basic chemistry is discussed too, for example by outlining how the number of protons defines the element. Isotopes are skipped. From there on, it moves to quarks and the standard model. Compared to what I have read about quarks some 20 years ago (a Dutch translation of a book by Gell-Mann), the field has moved on, and one speaks now of many types of flavors, with different properties.

Comment 1: And that gets me to one downside of this book: it features only few graphics. As someone who like to works with patterns, visualization is a rather helpful tool. Instead, the book is mostly text-oriented. Yet, this is easy to fix in future editions of the book. (I understood the book is already in a second print.)

It also gets me to a second problem, but that is more of a problem for me, than of the book: I have prior knowledge, a bit outdated, a bit from a different field, and found it very hard at occasions to relate the material presented to things I know, or think I know. Some of my confusion I discussed with the author over Twitter (#twitpubinnov), such as that the Higgs field is something different than the gravity field, even though the Higgs field (or particle) gives matter mass. So, there is something that gives things mass (the Higgs boson), and something that makes things with mass interact (the graviton). And linking that the further prior knowledge: it is like there is a field that gives electrons charge, and the electric field that causes charged particles to interact. But, but... hence my headaches.

Comment 2: It is hard to find a pattern in the many fields. While Schilling gives a clear overview of all particles, and while he also hints at particle/wave duality and also mentions that the Higgs particle is like a ripple of the Higgs field (a particle/field duality?), I remain with many questions around these aspects of current physics knowledge.

The book then continues as a kind of dictionary or encyclopedia, coverting the topics atlas, boson, CERN, dark matter, energy, fermion, graviton, Higgs particle, interaction - yes, the Dutch words are *very* similar -, J/PSI, forces ("krachten"), lepton, mass, neutrino, discovery ("ontdekking"), Peter Higgs, quark, renormalisation, sigma, theory, universe, field ("veld"), competition ("wedloop"), and gravity ("zwaartekracht"). These chapters are easy to read, but have the tendency to lead to more questions. This is covered to some extend by cross-linking between chapters, but I think the learnability can be improved. In blog posts we do this by adding hyperlinks, as I just did in this and the previous sentence. Of course, the paper nature of this book is not very helpful.

There are also a number of more specific things I like to point out. Some are because I like to learn more, others because I am not sure I fully understand what Schilling wanted me to pick up. For example, the chapter on energy has an analogy of the energy of the Higgs boson (~ 125 GeV) and indicates that is about the same energy matching the mass of a gold atom (presumable via the omnipresent E = m c2. My immediate question was, that if the Higgs boson has that much energy (or, is that 'heavy'), how can it give mass to something as light as an electron?

Comment 3: the book is writing speaks physics, and is harder to read for other natural scientists. For example, the book writes that the mass of the electron is so small compared to that of protons, that it can be ignored, but in biology and chemistry it is far from that, and very important in the field of mass spectrometry. Of course, both are true, depending on the field of research you are looking at.

Some things I tend to disagree with. Not on the physics, where Schilling outranks me by orders of magnitude. I do question a statement as that made in the first paragraph on the leptop. The text reads that "only at the end of the 19th century scientists start wondering if there are smaller particles than atoms". I don't buy that. I am sure they wondered about that before; just like we wonder what the particles are that constitute quarks, and why I want to understand biology at an atomic level. Did no biologist wonder if gene-protein is not a simple one-to-one relation before that dogma was overturned some 10-20 years ago? Did not biologist wonder what further mechanisms exist in gene regulation?

Instead, what happened at the end of the 19th century is that people started being able to convert that intuition, that curiosity into testable questions. And that relates to the text on theory, where Schilling writes that theories do not have a timeless truthfulness, citing Newton's gravity law. But with all that discussion about "only just a theory" where that chapter starts out with, in particular since he pulls in evolution theory, I think it misleads the author. What the text does not outline is that some theories have shown to be false, like the flat earth, while other theories just have a limited scope, like Newton's gravity law. Newton's law is perfectly valid, given a certain context.

This is actually in important issue, that bites many current discussions, for example, in ontology development, where you can find that some ontologies conflict with other ontologies, despite both of them being true. But I will have to write more on that at some other occasion :)

Comment 4: some analogies make the matter only more complex.

The book extensively uses analogies to 'visualize' the concepts. The book starts with with the analogy of a language, and later switches to other analogies. For example, in the chapter on quarks, the interactions of quarks are explained as interactions by humans. But I have the feeling that comparing baryons to threesomes of three guys or two Swedish women and an Irish lady is distracting at most. I would have preferred to stick to the language analogy.

The sigma has popped up on the internet often, in relation to the Higgs boson, which I found rather confusing at the time. Was this the statistical sigma, typically estimated to get a feel of chance distribution in measurements? But what does the "five sigma" refer to then? Obviously, the measurement of the Higgs particle's mass is not five sigma away from the theoretical value. So, what is it then? Unfortunately, Schilling's book does not make it much clearer, but at least it confirms that the sigma refers to the statistical one. He writes that the 'five sigma' is a community agreement on the significance of a experiment in natural sciences. I guess I missed something in my statistics PhD at Radboud University, because I had not heard about it before.

This ties in to the lack of learnability of the paper book medium: you do not want to give too much detail, because you loose the reader in the one dimension you have: from left to right, top to bottom (at least in English and Dutch). Webpages like this add dimensions, primarily via hyperlink. I have still to explore how the standard deviations of measurements can be statistically linked to the chance that that measurement reflects randomness in your experimental set up. The further reading at the last page do not do justice to the amount of curiosity triggered by this book.

And that is both the power and importance of this book: you want to know more.

The book is easy to read, provides a lot of pointers around the just discovered Higgs boson. This whole discovery outlines perfectly what science is about: we make models (I prefer that terms over theories) that describe experimental results and predicts the outcome of future experiments. That core aspect of science must be communicated more often and as early as possible in the education of people (kids). That is why I am so happy that Schilling writes so many works in Dutch.

I can highly recommend all secondary schools in the Netherlands (and other countries speaking Dutch) to buy this book. I also hope that future editions of this book will be extended with both graphics and with pointers to further reading after each individual chapter, but that must not stop you from getting a copy now. The price (~8 euro) won't stop you either.

Sunday, January 29, 2012

First month back in NL...

Moving country is exhausting. Living in a house full of boxes for a few weeks. Finding a house. Changing culture. Maybe it's a linguistic thing, but EU countries do not share the same culture. OK, we too have a McDonalds on every corner, but that's about it. But returning to The Netherlands was a cultural shock. A shock? Yes. I thought I knew the country I lived in most of my life.

Then, switching position. Posthopping (=post-doc here and there, attempting to find some local optimum where you both work on exiting things and try to set up a research group) around Europe (I have pension in four EU states now), while trying to keep writing papers and on top of that try to do something that in fact has impact on our science, means that every three months before the end of a post-doc position, and three months after you started the next, it's double work: finding your way around at the new university, while finishing those studies that almost were finished, in random, unpredictable order.

And, of course, being annoyed if your prime minister then claims he sometimes cannot get his work done in 40 hours. Well, one would actually think that a country in an economic crisis, with people eating up all their hard-worked-for saving just to get around, would do all his best to turn the future of the country around... oh well...

Sometimes I really wonder what I'm doing.

And then, in a spare hour here and there do something for myself. Like writing up this post, in an attempt to give all a place. Or finishing up a further paragraph of my book(let), or working on my contributions to the Pharmaceutical Bioinformatics book (molecular representation, semantic web for the life sciences). For my own Groovy Cheminformatics book(let): seventy more pages, and it's a book. Hard-cover, and I can start touring around Europe. BTW, I enjoy and can recommend reading Reinventing Discovery. Done the first 30 pages or so, and keep wondering how those examples can be scaled down to cheminformatics.

Sometime I really wonder why I keep working in an area that everyone just takes for granted and hardly cares about.

I'm tired, and this is slowly becoming a really boring and depressing blog post. That's a shame, because I have had a really great time in Roland Grafström and Bengt Fadeel, working among and with one of the greatest, enthusiastic research teams I have seen around Europe. Having to leave that makes me sad too. In fact, I have never ever been homesick, and now going back to the country I grew up, I am homesick. Well, it's a feeling I don't like.

Weirdly, I have many really exciting ideas, research-wise, and my exciting daily work at BiGCaT, which is now in Open PHACTS, the network in The Netherlands, I have much to enjoy here. Yes, it is again hopping to another application area of cheminformatics, after interaction of cheminformatics and chemometrics (my thesis), more fundamental cheminformatics, metabolite identification, pharmaceutical research, toxicity, and now back to drug discovery but also the metabolome. But I love the complexity of the metabolome, and have so much detailed insight in the other fields now... oh, the endless possibilities!

And then I remember why I am doing this to myself.

All the endless possibilities! All the research we can do so much better than now is done! The more accurate answers we get, and actually be in a situation where we can start identifying limitations of cheminformatics! Ha, and you know I love to look beyond the edge of the world.

But, then I realize again that I need funding, and wonder how I can live my dream, if no one believes in it.

Not that I have been completely unsuccessful. Au contraire. I did get funding, for travel on many occasions, and recently small bits for research too. But I am really eager to get some funding to have research the ideas I have, rather than working on them myself. And eager to get a fixed position. Though I am grateful to Chris Evelo for offering the three-year position I am in now.

Next time someone starts talking about interdisciplinary research, get a trout out of your bag. Interdisciplinary research is a buzz word that only works when you already have a single-disciplinary fixed position. Advice to students: never start an interdisciplinary research topic. You will never be the expert people will want to fund, because interdisciplinary research can simply be done by single-discipline experts in a collaboration, and much better than you could, with your years of experience (n=1).

I also now realize that strengthening another project is also no good for your own career. Your hard work will just go to that project. You can contribute as much to some project as you like, but the corresponding Dr. Who will get the fame. No wonder people rename, brand, and use rather than collaborate. We desperately need #altmetrics.

Yes, I realize this applies to the CDK too. I am trying hard to get recognition with those who deserve it. But who reads a copyright statement. Who remembers blog posts with change logs and statistics on who did the work. Scientists in charge of funding remember only the top person.

Ha, you see that pattern applies the publishing too, right? Scientists only too often care more about the JIF of the top concept, the journal, than the actual work, your actual damn paper.

Oh well, fortunately it's almost Monday again, so that I can focus on science again, and don't have to think about these things.

And, I am deeply grateful to all that publicly support my output. A citation to one of my papers, a public review of my book, a new tool that makes stands on the shoulders of your work! That makes a difference!

Then I remember again why I am doing all this. I can make a difference.

Friday, August 12, 2011

Usability: what happens if you neglect less abundant personas

Despite some an initially hesitant BioStar community, I got some good replies on my question about biology personas, including good material from an Søren Mønsted of CLC bio. Coincidentally, a few humorous perspectives came online, which in fact nicely demonstrate what I'm at.

When building a new platform, you need to know who will be using it and how, and how those people will interact. So, for our ToxBank design we need personas to do the requirement analysis, and I have created initial draft personas now, which I hope I'll be able to share later.

So, how people interact is important, as communication is central to scholarly research. For example, this is why we blog: they are like conferences. And some insight in how the various personas look at each other can be helpful in describing personas and modeling an social science platform. Matus Sotak (aka @biomatushiq) created this funny but right-on overview:

So, there it is: five personas, each of whom characterizes the other. These views reflect how others think about that persona, which is what a persona is all about: a virtual character we recognize and can characterize in terms as done in this plot. If we hook this to requirements, we could observe that the less-knowledgeable need better access to important literature. Just to name something off the top of my head.

The second is a XKCD comic. This one is more important to the message of this post: what happens if you neglect personas? The above comic shows that ignoring personas is daily business, but is that bad?

This show two personas, an average user who appreciates cool GUIs and apps on cool topics, and a regular dude who lives in an area where tornadoes actually occur. The take home message here is that mere ratio of persona abundance is not generally a proper guide for design.

Now, try to map these two comics to anything you see around. For example, do the five personas match your research group? How does the head of your group handle this? Is hes accepting the status quo, or is hes trying to overcome these stereotypes? How do these personas get reflected in author lists? How does that map onto how you think about your EU project partners? Is it useful?

Repeating this experiment for the second comic is more useful. For example, map this comic to your citation list, and then reevaluate the impact of your research. This is exactly why CiTO is crucial. For our ToxBank project this last observation has major implications too.

Wednesday, May 04, 2011

The costs of VR fund applications. Tell me I'm wrong...

One of the main Swedish funding agencies is called the Vetenskaprädet (VR). They just reported the number of applications in their big funding round of this year: 4606! That's is a staggering number. I decided to do some math here. Let's assume each proposal took about one week of effort, possible shared between two or more scientists. Let's assume the rate of one week of scientist at a Swedish university is about 1000 euro (so, including common overhead). That means that it costed the scientific community more then 4.6 million euro to apply for funding this year! Wow! Consider that amount 12% gets awarded (10% at any university, and 20% at KI where I work), that means that about 3.5 million euro was wasted making scientists learn to write grant applications... well spent indeed!

Now, I truly hope I am making some stupid mistake in my argumentation here... I don't like to thing what that amount of funding in one year could do for Open Source cheminformatics... (Oh, and this excludes the money needed to actually read these proposals, though they do not actually read all.)

Thursday, January 20, 2011

Is Nature really clueless about Blogs, Twitter, etc? WTF ?!

My apologies for this rant in the early morning, but WTF?? (what the fuzz??) I just got pointed to this Peer review: Trial by Twitter (doi:10.1038/469286a) by Mandavilli. Cool title, but before I even finished seventeen words of the intro... WTF?? Here it is:

    Blogs and tweets are ripping papers apart within days of publication, leaving researchers unsure how to react.

What?? Is she mocking me? I know (I have been a reported of a university news paper) that intros must encourage the reader to read on... but What?? (And I read the intro a third time...)

I'll have to read the full thing later, if that makes more sense. But is she clueless? Are all people clueless about blogging, tweeting, etc?? Remember Royce Murray? Has she actually read the Trial by Twitter only so recently?

Dear Mandavilli, in case you do run into this blog post, here's my reply to your intro: "The researchers have no problems how to react, they just did."

Now, after I cooled down a bit, and anticipating I got it all wrong, she might refer to the researchers of the publication being ripped apart. In that case, I am tempted to believe that also in the English language one is expected to use (well, forgive me I do not know the exact term) "leaving the researchers ...", where 'the' links 'researchers' to something said earlier. Now, I read, probably wrong, researchers as any researcher interested in that publication. Mandavilli could even have written "leaving the authors...". But what do we have Nature editors for, right?

Anyways, I do believe this will be an interesting read once I managed to read past (for the fourth time) the intro of this article.


Friday, December 10, 2010

Trust has no place in science #2

Thanks to all who replied and shared their views. Particular thanx to Christina who replied in her blog. With Saml, and Cameron and Bill they think this is about semantics. Linguistic tricks. I hope not; this is too serious to get away with such. "Reliable, trustworthy, assumptions": it's all working around the real issue. Similarly, splitting up 'trust' into 'blind trust' and 'smart trust' is just working around the real problem.

Indeed, my point is different. The key of science is to replace trust by facts. Or, when talking about database, software, research papers in Nature, it is replacing trust with traceability. Actually, we seem to have lost a long-standing tradition of citing previous work when we write down the arguments we base our argumentation on. Facts are backed up with references, providing the required traceability.

Now, compare that to current electronic sciences. We 'trust' our database to have done something sane. Well, don't. They made an attempt, but made errors. As they say with software, having zero bugs just means you have not found them yet.

The real point with 'trust' is, is that it is completely irrelevant. It adds zero to the scholarly discussion. Whether you trust the highly curated ChEMBL database or not, it has errors. (Noel pointed out one source of ambiguity in the ChEMBL database this week). What does matter, instead, is if those errors are significant. Do they affect the conclusions I draw when I use this data. That is what actually matter. Trust has no place in science. Error has.

Sadly, this is basically the hypothesis of the VR grant I wrote up but did not get awarded. But I trust I do better next time.

Why this matters? Well, this is what ODOSOS is about: bring back the traceability into science, and get rid of trust.

Monday, December 06, 2010

Trust has no place in science

One discussion I had often had in the past year, is about trust in science. I, for one, believe (hahahaha; you see the irony? ;) that trust has nothing to do with science. Likewise, any scholar should be, IMHO, hes is suspicious when someone talks about trust. A scholarly scientist will never trust any result: hes will accept it as true or false, but will take responsibility for that decision; hes will not hide behind 'but I trusted him' or 'but it was published in Nature'.

Antony asked last week the community to answer a questionnaire, which turned out the be about our trust in online chemical database. He presented the results at the EBI. This is the slide that summarizes the results from that questionnaire:

We see that trust clearly has a very significant place in science. How disappointing. You can spot me in these results easily: I am the one that consequently answered 'Never Trust' for all databases. It's not that I do not value those databases, but there is no need for them to trust them. I verify. This is actually a point visible in Tony's presentation: we can compare databases.

This is the point that I and others have been making for more than a decade now: if we do things properly, we can do this verification. Anyone can. With Open Data, Open Source, and Open Standards we can. I can only stress once more how important this is. We trust people, we trust government, but repeatedly this trust is taken advantage of. Without transparency, people can hide. By being able to hide, human loose there ability to decide what is right. With transparency, we see things return to normal, as we saw this week with UK politicians.

Further reading in my blog:

Update: if you liked this post, you will also like blogs posts like this one from Björn.

Saturday, October 16, 2010

Royce Murray and Caveat Emptor

Derek's blog pointed me to an editorial by Royce Murray Science Blogs and Caveat Emptor (doi:10.1021/ac102628p). He is warning us, science scholars, for blogs. He is accusing bloggers for not being scholarly, not checking facts etc.

He did himself and the journal a big disfavor with this editorial: in his blog he does precisely what he is accusing the blogger of: fail to check facts. Even worse, particularly for the 'Analytical Chemistry' journal, he showed inadequate in analyzing the problem, putting his scholarly skills at questionable levels: he failed to see what 'blogging' is and what it is not, and he failed to ascribe his concerns to the proper source; effectively, he failed to see the difference between correlation and cause-effect for 'blogging' (unworthy to any scholar, particularly if you start complaining). I invite Royce to blog his full analysis of the problem, with proper underlying data, facts, etc, so that I (and others) can explain to him the true factors involved in this problem he is noticing.

The editorial is a sad piece, and an editorial unworthy for the journal.

Actually, the fact that he mentions the Impact Factor is amusing. It must be noted that his editorial will have a huge impact, but not because the writing is any good, but because it is utterly wrong. And that reflects only one thing that is wrong with impact factors.

I strongly suggest Royce to checks his facts before he starts writing. The ethics expressed in the editorial seems only to apply to other scholars.

I you wonder about my strong language. That was triggered by these words from the editorial: In the above light, I believe that the current phenomenon of “bloggers” should be of serious concern to scientists. I consider myself a blogger, not unreasonable giving the fact that I blog, and feel personally attacked. Hence, the title of this post: Royce Murray and Caveat Emptor.
Murray R (2010). Science Blogs and Caveat Emptor. Analytical chemistry PMID: 20939598

Sunday, August 01, 2010


Yesterday I arrived in Oxford, after a 3.5 hour bus transfer from London Stansted. Long, boring ride (though I might have seen a few red kites, but seeing that they were near extinct, I am wondering what other large bird of prey has strong split tail like a swallow). Showed once more that the UK infrastructure has hardly changed since the 19th century. Enjoying an undergraduate room at one of the colleges. Pretty basic, but makes me feel more like a human than a tourist. Yes!, undergraduate students are human too! One of the advantages is you get an excellent internet connection :)

Anyways, going to the Predictive Toxicology workshop, thanx to the bursary award I received from echeminfo (see Oxford, August 2010: eCheminfo Predictive ADME & Toxicology 2010 Workshop).

This afternoon I walked around a bit, watching all the old buildings. But I guess being here without anyone to share it with, and that it just looks Cambridge, makes me not-so-much impressed. Moreover, it's too busy with tourists and people randomly wearing Oxford University sweatshirts. Small and nice was the Museum of the History of Science, with some nice chemical pieces, like this one:

Buildings like the Radcliffe Camera are nice on the outside, but closed. Seems I have to become a fellow first. This is what it looked like today:

Quite interesting too was the Oxford University Press shop. I'm a sucker for books. Apparently, you can just write a book and publish it. For example, an extensive list of dictionaries on about anything... and since I have been writing several book chapters right now, perhaps this is actually an interesting route...

But the question is, of course, how long will we keep reading books... they're the hamburgers of educational material... Kindle and alikes will soon drop in price, and cost some €30 euro. But e-book prices will have to drop too, and I still do not get why an e-book is more expensive than a paperback... (see Amazon, the Kindle edition is more expensive than the paperback??). But then again... they are rich, and I am not.

There was some recent talk about the fact that no one can be Open to the full. You either do Open Data or Open Source, and make a living from the rest. That's where I nicely show I know bullocks of economics. I do BODR, CDK, ... all Open, all for free.

OK. That's a plus for Oxford... it makes you think about things. Perhaps there is something to morphogenetic fields...

Tuesday, June 15, 2010

I’m just a sucker with no self esteem

Don't let anyone fool you: The Offspring is really just talking about Science:

When she’s saying that she wants only me
Then I wonder why she sleeps with my friends
When she’s saying that I’m like a disease
Then I wonder how much more I can spend
Well I guess I should speak up for myself
But I really think it’s better this way
The more you suffer
The more it shows you really care
Right? yeah yeah yeah

Really! Just read the whole text:

Late at night she knocks on my door
Drunk again and looking to score

Now I know I should say no
But that’s kind of hard when she’s ready to go
I may be dumb
But I’m not a dweeb
I’m just a sucker with no self esteem

Monday, February 22, 2010

What IF or Article Level Metrics does not tell you...

This weekend there was the really nice Science Commons Symposium, which I virtually attended, and there is an interesting discussion at FriendFeed on article level metrics.

Now, I just reported on the CDK functionality used in published research. Linking this to impact, the CDK with 115 citations now (both papers, nice increase from 2006) is not doing bad. But the real impact goes further than the direct citations. The BRENDA enzyme database is one of the project where CDK functionality is (was?) used, and the matching papers (doi:10.1093/nar/gkh081 and pmid:11752250) have been cited 241 times. Surely, BRENDA does very much more than just the used CDK functionality.

But, in my opinion it does something about the impact of the CDK too. What do you think? Should these counts be included in the article level metrics too? I am almost tempted to even pose that those counts are more interesting that the number of blog replies...

Friday, February 19, 2010

Open Data: the Panton Principles

The announcement of the Panton Principles is the big news today, though Peter already spoke about them in May last year (see coverage on FriendFeed and Twitter). The four principles list in their short versions:
  1. When publishing data make an explicit and robust statement of your wishes.
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.
  4. Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.
I think these are very workable next steps in Open Date, perhaps even worthy end goals. I endorse them.

Principle 1: an explicit and robust statement
This is in my opinion the most important principle. Too often you find a database with really useful data, but without any clue about what you are allowed to do with this data. Of course, I can contact the authors, get their permission, etc. They probably like it that way, and I can even understand that. However, it does not scale, and it is slow. Even worse is the situation when the original composer gets missing in action. Both are equally valid, but explicit statements just make things easier.

Principle 2: use a waiver or license appropriate for data
This principle is debatable. Very much like the BSD-vs-GPL flamewars, some like copylefting, others do not. There is an important difference though. Software has the concept of interfaces, allowing to more easily share incompatible licenses cleanly separated by these interfaces. This, for example, allows you to run proprietary software on a Linux kernel. However, data sets do not have such a concept. There is not such thing as an interface between two numbers.

This makes the concept of mixing data sets different: because there is no such interface, any mixing can only happen between compatible licenses. This is one reason behind the choice of very liberal licenses like CC0. This license, or waiver really, allows you to do anything, and most certainly, mix data sets.

And that makes things a lot easier. But then again, while these are nobel goals, I rather see people use a copylefting licenses than no license at all.

Principle 3: non-commercial and other restrictive clauses should not be used
I think again making things easier is the goal. The non-commercial clause is interesting, and actually likely an important one. Consider course material, a course book. Those are commercial. Some even argued that many universities themselves are actually commercial entities.

Principle 4: the public domain via PDDL or CCZero is strongly recommended
I second these choices over a mere claim claim that the data is public domain. The PD concept has many meanings and not the same in every jurisdiction. In particular, differences between USA and EU law. Waiving these right, which is just the same as claiming public domain, works in any jurisdiction, again, making things a lot easier.

Open Data, Open Source, Open Standards are not goals
The underlying pattern of my comments must be clear: the principles make life easier. This is all what Open Source and Open Standards (whatever those are).

    The three pillars of the ODOSOS mantra is not goals, but merely the means of making life easier.

The Panton Principles certainly make life easier in Open Data, and initiative like the Linking Open Drug Data in which I participate will greatly benefit from people adopting them.

The Principles do not solve all problems. There is still a lot of 'Open Data' licensed with unrecommended licenses. For example, the NMRShiftDB uses a GNU FDL license, and data from supplementary material of Open Access journal articles is like Creative Commons.

Another related initiative should certainly not go unnoticed either: Is it Open Data? is a service where you can try to resolve what the license is for one of those databases which is not quite Panton Principles compatible yet.

OK, one last thing. The Dutch government is bursting, and I want to listen to the music. With permission, I have been hacking the Panton Principles endorsement page, and injected some extra span elements, to make it easier to machine process (again, to make things easier), so you can use the following one-liner to calculate the number of people endorsing the principles:
    $ wget -O endorsed.html; xpath -q -e "//span[@class='signature']/span[@class='Country']/text()" endorsed.html | sort | uniq -c
The current count is hitting 44 now, and has not quite reached the 500 I had hoped for yet:
1 Australia
      1 Canada
      1 Catalonia
      2 Espana
      2 France
      6 Germany
      1 Greece
      1 Italy
      1 Netherlands
      1 New Zealand
      1 Norway
      1 Poland
      1 Slovenia
      1 Sweden
      1 Switzerland
      1 The Netherlands
      9 UK
      1 U.K.
      1 United Kingdom
      1 United States of America
      9 USA
Anyone knows how we can convert this into some nice world map graphics with a few lines of code?

Now, I am looking for a bar in Uppsala to write up some ideas about what specifications are :)

Wednesday, November 04, 2009


While I am still looking around for a assisting/associate professor position, there are two milestones around my scientific work I want to briefly mention here. This blog is the 500th blog on chem-bla-ics, and the two CDK papers have combined reached 100+ citations as counted by Web-of-Science, as can be seen on my ResearcherID profile.

Wednesday, October 07, 2009 funded research to be OA as of 2010

Happy news from the Swedish Vetenskapsradet (via Coturnix): as of next 2010 all peer reviewed journal papers must be Open Access. I am not yet VR funded, but involved in a few VR grant applications. Not that that really matters, as I am happily publishing OA already.

Wednesday, June 17, 2009

No, PDFs really do suck!

A typical blog by Peter MR made (again), The ICE-man: Scholary HTML not PDF, the point of why PDF is to data what a hamburger is to a cow, in reply to a blog by Peter SF, Scholarly HTML.

This lead to a discussion on FriendFeed. A couple of misconceptions:

"But how are we going to cite without paaaaaaaaaaaage nuuuuuuuuuuumbers?"
We don't. Many online-only journals can do without; there is DOI. And if that is not enough, the legal business has means of identifying paragraphs, etc, which should provide us with all the methods we could possibly need in science.

Typesetting of PDFs, in most journals, is superior than HTML, which is why I prefer to read a PDF version if it is available. It is nicer to the eyes.
Ummm... this is supposed to be Science, not a California Glossy. It seems that pretty looks is causing major body count in the States. Otherwise, HTML+CSS can likely beat any pretty looks of PDF, or at least match it.

As I seem to be the only physicist/mathematician who comments on these sort of things, I feel like a broken record, but math support in browsers currently sucks extremely badly and this is a primary reason why we will continue to use PDF for quite some time.
HTML+MathML is well established, and default FireFox browsers have no problem showing mathematical equations. For years, the Blue Obelisk QSAR descriptor ontology has been using such a set up for years. If you use TeX to author your equations, you can convert it to HTML too.

We can mine the data from the PDF text. Theoretically, yes. Practically, it is money down the drain. PDF is particularly nasty here, as it breaks words at the end of a line, and even can make words consist of unlinked series of characters positioned at (x,y). PDF, however, can contains a lot of metadata, but that is merely a hack, and unneeded workaround. Worse, hardly used regarding chemistry. PDF can contain PNG images which can contain CML; the tools are there, but not used, and there are more efficient technologies anyway.

I, for one, agree with Peter on PDF: it really suck as scientific communication medium.

Tuesday, March 03, 2009

Open Data versus Capatalism?

Ian Davis was recently quoted saying open data is more important than open source, which was pulled (out of context) from this presentation. The context was (a slide earlier): Data outlasts code.

As far as I can see, this is utter nonsense, even within context of the slide (see also this discussion on FriendFeed). Obviously, within the context of Ian it does makes sense, and I hope he will respond in his blog and explain why he thinks Open Data is more special.

Without code, you have no way of accessing the data. Ask anyone to recover from a hard disk failure. In ODOSOS (Open Standards, Open Data, Open Source) they are all equal. You need them all for progress. You cannot single out one as being more important than another. Why would you anyway? Politics is all I can think of... All three combine and ensure our science is more efficient.

Fishy Perspective (what's in a name) comments on this in Data Vendetta, and I will take one quote out of context:
    Organizations are spending lot of money do generate proprietary data to safeguard its competitive edge, why you are convinced that they need to disclose that, no one is here for charity. Most the companies have their proprietary data policies, and they release the data in public only when there is sufficient overlap from publicly available databases.
Open Data versus Capitalism?
Companies are about money making, and there is nothing wrong with that. Others to work to make the world a better place.

If Rosalind had not shared her data (following Data Vendetta, and not going into whether she did willingly or knowingly), all current pharmaceutical research would have been delayed by half a year(?), more(?)... who knows. Even that half year would have meant quite a lot of death people. A lot of medicine would have not been discovered or hit the market at the same time. Capitalism is one thing, not good, not bad, orthogonal really. Capitalism as ideology does not contradict Open Data. But sharing knowledge as Open Data always has a positive effect on mankind.

If you want to make money, please do, as much as you can. But please pick carefully what you want to make money on. Be creative! Do some innovation! Be bold! Go where no one has gone before!