Sunday, December 21, 2014

Data sharing

When you want to share data with others, the first thing you need to decide is under what terms. Do you want others be able just look at the data, change the format (Excel to CSV, for example)? Do you want this person to be allowed to share the data within his research group, or perhaps collaborators?

Very often this is arranged informally, in good faith, or by some consortium, confidentiality, or non-disclosure agreements. However, these approaches do not scale very well. When this matters, then data licenses are an alternative approach (not necessarily better!).

Madeleine Gray has written on her EUDat blog some words about the EUDAT License Wizard. This wizards talks you through the things you like to agree on, and in the end suggests a possible data license. It seems pretty well done, and the first few questions focus on an important aspect: do you even have the rights to change the license (i.e. you are the copyright owner).

Mind you, there are huge differences between countries around data copyright.

Saturday, December 06, 2014

Nature publications and the ReadCube: see but no touch. Or... score for the news
item by Van Noorden (see text).
The big (non-science) news this week was the announcement that papers from a wide selection of Nature Publishing Group (NPG) journals can now be shared allowing others without a subscription to read the paper (press release, news item). That is not Open Access to me (that requires the right to modify and redistribute modifications), but does remove the pay-wall and therefore speeding up dissemination. It depends on your perspective if this news is good or bad. I rather see more NPG journals go Open Access, like Nature Communications. But I have gotten used to the publishing industry moving slowly.

Thanks to (a sister company of ReadCube) and the DOI we can easily find all discussion around the news item by Van Noorden (doi:10.1038/nature.2014.16460) in blogs. From the free, open dissemination of scientific knowledge ideal point of view the product is limited:
Of course, it is a free boon, and from that perspective this is a welcome move:
And it is a welcome move! We have all been arguing that there are so many people not able to access the closed-access papers, including politicians, SMEs, people from 3rd world countries, people with a deathly illness. These people can now access these Nature papers. Sort of. Because you still need a subscription to get one of these ReadCube links to read the paper without pay-wall. It lowers the barrier, but the barrier is not entirely gone yet.

Unless people with a subscription start sharing these links as they are invited too. At this moment it is very clear when and how these links can be shared. Now, this is an interesting approach. In most jurisdictions you are allowed to link to copyrighted material, but the user agreement (UA) can put additional restrictions. When and how this UA applies is unclear.

For example, Ross Mounce suggested to use PubMed Commons. Triggered by that idea, I experimented with the new offering and added a link on CiteULike. I asked if this is acceptable use, and it turned out to be unclear at this moment:
This tweet also shows the state of things, and I congratulate NPG with this experiment. Few established publishing companies seem willing to make these kind of significant steps! Changing the nature of your business is hard, and NPG trying to find a route to Open Science that doesn't kill the company is something I welcome, even if the road is still long!

A nice example of how long this road is, is the claim about "read-only". I think this is bogus. ReadCube has done a marvelous job at rewriting the PDF-viewer and manages to render a paper fully in JavaScript. That's worth a congratulations! It depends on strong hardware and a modern browser. Older browsers with older JavaScript engines will not be able to do the job (try opening a page with KDE's Konqueror). But clearly it is your browser that does the rendering. Therefore:
  1. you do download the paper
  2. you can copy the content (and locally save it)
  3. you can print the content
Because the content is sent to your browser. It only takes basic HTML/JavaScript skills to notice this. And the comment that you cannot print them?? Duh, never heard of screenshots?! Oh, and you will also notice a good amount of tracking information. I noted previously that it knows at least which institute originally shared the ReadCube link, possible in more detail. Others have concerns too.

Importantly, it is only a matter of time that someone clever and too much time (fortunately for ReadCube, scientists are too busy writing grant applications) that someone uses the techniques this ReadCube application uses to render it to something else, say a PDF, to make the printing process a bit easier.

Of course, the question is, will that be allowed. And that is what matters. In the next weeks we will learn what NPG allows us to do and, therefore, what their current position is about Open Science. These are exciting times!

Sunday, November 16, 2014

Programming in the Life Sciences #20: extracting data from JSON

I previously wrote about the JavaScript Object Notation (JSON) which has become a de facto standard for sharing data by web services. I personally still prefer something using the Resource Description Framework (RDF) because of its clear link to ontologies, but perhaps JSON-LD combines the best of both worlds.

The Open PHACTS API support various formats and this JSON is the default format used by the ops.js library. However, the amount of information returned by the Open PHACTS cache is complex, and generally includes more than you want to use in the next step. Therefore, it is needed to extract data from the JSON document, which was not covered in the post #10 or #11.

Let's start with the example JSON given in that post, and let's consider this is the value of a variable with the name jsonData:

    "id": 1,
    "name": "Foo",
    "price": 123,
    "tags": [ "Bar", "Eek" ],
    "stock": {
        "warehouse": 300,
        "retail": 20

We can see that this JSON value starts with a map-like structure. We can also see that there is a list embedded, and another map. I guess that one of the reasons why JSON has taken such a flight is how well it integrates with the JavaScript language: selecting content can be done in terms of core language features, different from, for example, XPath statements needed for XML or SPARQL for RDF content. This is because the notation just follows core data types of JavaScript and data is stored as native data types and objects.

For example, to get the price value from the above JSON code, we use:

var price = jsonData.price;

Or, if we want to get the first value in the Bar-Eek list, we use:

var tag = jsonData.tags[0];

Or, if we want to inspect the warehouse stock:

var inStock = jsonData.stock.warehouse;

Now, the JSON returned by the Open PHACTS API has a lot more information. This is why the online, interactive documentation is so helpful: it shows the JSON. In fact, given that JSON is so much used, there are many tools online that help you, such as (yes, it will show error messages if the syntax is wrong):

BTW, I also recommend installing a JSON viewer extension for Chrome or for Firefox. Once you have installed this plugin, you can not just read the JSON on Open PHACTS' interactive documentation page, but also open the Request URL into a separate browser window. Just copy/paste the URL from this output:

And with a JSON viewing extension, opening this URL in your browser window will look something like:

And because these extensions typically use syntax highlighting, it is easier to understand how to access information from within your JavaScript code. For example, if we want the number of pathways in which the compound testosterone (the link is the ConceptWiki URL in the above example) is found, we can use this code:

var pathwayCount = jsonData.result.primaryTopic.pathway_count;

Programming in the Life Sciences #19: debugging

Debugging is the process find removing a fault in your code (the etymology goes further back than the moth story, I learned today). Being able to debug is an essential programming skill, and being able to program flawlessly is not enough; the bug can be outside your own code. (... there is much that can be written up about module interactions, APIs, documentation, etc, that lead to malfunctioning code ...)

While there are full debugging tools, achieving the task of finding where the bug is can often be reached with simpler means:
  1. take notice of error messages
  2. add debug statements in your code
Error messages
Keeping track of error messages is first starting point. This skill is almost an art: it requires having seen enough for them to understand how to interpret them. I guess error messages are the worst developed aspects of programming language, and I do not frequently see programming language tutorial that discuss error messages. The field can certainly improve here.

However, at least error messages in general give an indication where the problem occurs. Often by a line number, though this number is not always accurate. Underlying causes of that are the problem that if there is a problem in the code, it is not always clear what the problem is. For example, if there is a closing (or opening) bracket missing somewhere, how can the compiler decide what the author of the code meant? Web browsers like Firefox/Iceweasel and Chrome (Ctrl-C) have a console that displays compiler errors and warnings:

Another issue is that error messages can be cryptic and misleading. For example, the above error message "TypeError: searcher.bytag is not a function example1.html:73" is confusing for a starting programmer. Surely, the source code calls searcher.bytag() which definately is a function. So, why does the compiler say it is not?? The bug here, of course, is that the function called in the source code is not found: it should be byTag().

But this bug at least can be detected during interpretation and executing of the code. That is, it is clear to the compiler that it doesn't know how to handle the code. Another common problem is the situation where the code looks fine (to the compiler), but the data it handles makes the code break down. For example, an variable doesn't have the expected value, leading to errors (e.g. null pointer-style). Therefore, understanding the variable values at a particular point in your code can be of great use.

Console output
A simple way to inspect the content of a variable is to use this console visible in the above screenshot. Many programming languages have their custom call to send output there. Java has the System.out.println() and JavaScript has console.log()

Thus, if you have some complex bit of code with multiple for-loops, if-else statements, etc, this can be used to see if some part of your code that you expect to be called really is:

console.log("He, I'm here!");

This can be very useful when using asynchronous web service calls! Similarly, see what the value of some variable is:

var label = jsonResponse.items[i].prefLabel;
console.log("label: " + label);

Also, because JavaScript is not a strongly typed programming I frequently find myself inspecting the data type of a variable:

var label = jsonResponse.items[i].prefLabel;

console.log("typeof label: " + typeof(label));

These tools are very useful to find the location of a bug. And this matters. Yesterday I was trying to use the histogram code in example6.html to visualize a set of values with negative numbers (zeta potentials of nanomaterials, to be precise) and I was debugging the issue, trying to find where my code when wrong. I used the above approaches, and the array of values looked in order, but different from the original example. But still the histogram was not showing up. Well, after hours, and having asked someone else to look at the code too, and having ruled out many alternatives, she pointed out that the problem was not in the JavaScript part of the code, but in the HTML: I was mixing up how default JavaScript and the d3.js library add SVG content to the HTML data model. That is, I was using <div id="chart">, which works with document.getElementById("chart").innerHTML, but needed to use <div class="chart"> with the".chart").innerHTML code I was using later.

OK, that bug was on my account. However, it still was not working: I did see a histogram, but it didn't look good. Again debugging, and after again much too long, I found out that this was a bug in the d3.js code that makes it impossible to use their histogram example code for negative values. Again, once I knew where the bug was, I could Google and quickly found the solution for it on StackOverflow.

So, the workflow of debugging at a top level, looks like:
  1. find where the problem is
  2. try to solve the problem
Happy debugging!

Programming in the Life Sciences #18: Molecular weight distribution of compounds with measured activities against a target (and other examples)

Eating your own dog food is an rather useful concept in anything where a solution or product can change over time. This applies to science as much as programming. Even when we think things are static, they may not really be. This is often because we underestimate or are just ignorant against factors that influence the outcome. By repeatedly dogfooding, the expert will immediately recognize the effect of different influencing factors.

Examples? A politician that actually lives in a neighborhood where he develops policies for. A principle investigator that tries to reproduce an experiment himself from one of her/his postdocs or PhD students. And, of course, the programmer that should use his own libraries himself.

Dogfooding, however, is not the single solution to development; in fact, it can be easily integrated with other models. But it can serve as an early warning system, as the communication channels between you and yourself are typically much smaller than between you and the customer: citizen, peer reviewer, and user, following the above examples. Besides that, it also helps you better understand the things that is being developed, because you will see factors that influence in action and everything becomes more empirical, rather than just theoretical ("making money scarce is a good incentive for people to get of the couch", "but we have been using this experiment for years", "that situation in this source code will never be reached", etc).

And this also applies when teaching. So, you check the purity of the starting materials in your organic synthesis labs, and you check if your code examples still run. And you try things you have not done before, just to test the theory that if X is possible, Y should be possible too, because that is what you tell your students.

As an example, I told the "Programming in the Life Sciences" students that in literature researchers compare properties of actives and inactives. For example, the molecular weight. Just to get some idea of what data you are looking at, up to uses of things like Lipinski's Rule of Five. Therefore, I developed a HTML+JavaScript page using Ian Dunlop's excellent ops.js and the impressing d3.js library to use the Open PHACTS Application Programming Interface:

And compared to last year when only the source was available, all these examples can now be tested online on the following GitHub pages (using their brilliant gh_pages system):

  • Example 1: simple example where the Open PHACTS Identity Resolution System (name to identifier) system is used
  • Example 4: uses d3.js to show a bar plot of the number of times a particular unit is used to measure activities of paracetamol
  • Example 5: the same as example 3, but then as pie chart
  • Example 6: the above molecular weight example
Of course, what the students last year and probably this year will produce is much more impressive. And, of course, compared to full applications (I recommend browsing this list by the Open PHACTS Foundation), these are just mock ups, and they are. These examples are just like figures in a paper, making a specific point. But that is how these pages are used: as arguments to answer a biological question. In fact, and that is outside the scope of this course, just think of what you can do with this approach in terms of living research papers. Think Sweave!

Thursday, November 06, 2014

Programming in the Life Sciences #17: The Open PHACTS scientific questions

Data needs for answering the scientific questions. From
the paper discussed in this post (Open Access).
I think the authors of the Open PHACTS proposal made a right choice in defining a small set of questions that the solution to be developed could be tested against. The questions being specific, it is much easier to understand the needs. In fact, I suspect it may even be a very useful form of requirement analysis, and makes it hard to keep using vague terms. Open PHACTS has come up with 20 questions (doi:10.1016/j.drudis.2013.05.008; Open Access):

  1. Give me all oxidoreductase inhibitors active <100 nM in human and mouse
  2. Given compound X, what is its predicted secondary pharmacology? What are the on- and off-target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?
  3. Given a target, find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives
  4. For a given interaction profile – give me similar compounds
  5. The current Factor Xa lead series is characterized by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X
  6. A project is considering protein kinase C alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that could modulate the target directly? I.e. return all compounds active in assays where the resolution is at least at the level of the target family (i.e. PKC) from structured assay databases and the literature
  7. Give me all active compounds on a given target with the relevant assay data
  8. Identify all known protein–protein interaction inhibitors
  9. For a given compound, give me the interaction profile with targets
  10. For a given compound, summarize all ‘similar compounds’ and their activities
  11. Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not)
  12. For my given compound, which targets have been patented in the context of Alzheimer's disease?
  13. Which ligands have been described for a particular target associated with transthyretin-related amyloidosis, what is their affinity for that target and how far are they advanced into preclinical/clinical phases, with links to publications/patents describing these interactions?
  14. Target druggability: compounds directed against target X have been tested in which indications? Which new targets have appeared recently in the patent literature for a disease? Has the target been screened against in AZ before? What information on in vitro or in vivo screens has already been performed on a compound?
  15. Which chemical series have been shown to be active against target X? Which new targets have been associated with disease Y? Which companies are working on target X or disease Y?
  16. Which compounds are known to be activators of targets that relate to Parkinson's disease or Alzheimer's disease
  17. For my specific target, which active compounds have been reported in the literature? What is also known about upstream and downstream targets?
  18. Compounds that agonize targets in pathway X assayed in only functional assays with a potency <1 μM
  19. Give me the compound(s) that hit most specifically the multiple targets in a given pathway (disease)
  20. For a given disease/indication, give me all targets in the pathway and all active compounds hitting them
Students in the Programming in the Life Sciences course will this year pick one of these questions as a starting point in the project. The goal is to develop a HTML+JavaScript solution that will answer the question the selected. There is freedom to tweak the question to personal interests, of course. By selecting a simpler pharmacological question that last year, more time and effort can be put into visualization and interpretation of the found data.

Saturday, October 25, 2014

The Web - What is the issue?

From Wikipedia.
Last week I gave an invited presentation in the nice library of the Royal Society of Chemistry, at the What's in a Name? The Unsung Heroes of Open Innovation: Nomenclature and Terminology meeting. I was asked to speak about HTML in this context, something I have worked with as channel for communication of scientific knowledge and data for almost 20 years know. Mostly in the area of small molecules, starting with the Dictionary of Organic Chemistry, which is interesting because I presented the web technologies behind this project also in London, October 10 years ago!

As a spoiler, the bottom line of my presentation is that we're not even using 10% of what the web technologies have to offer us. Slowly we are getting there, but too slow in my opinion. For some weird behavioral law, the larger the organization the less innovation gets done (some pointers).

Anyway, I only had 20 minutes, and in that time you cannot do justice to the web technologies.

Papers that I mention in these slides are given below.
Wiener, H. Structural determination of paraffin boiling points. Journal of the American Chemical Society 69, 17-20 (1947). URL
Murray-Rust, P., Rzepa, H. S., Williamson, M. J. & Willighagen, E. L. Chemical markup, XML, and the world wide web. 5. applications of chemical metadata in RSS aggregators. J Chem Inf Comput Sci 44, 462-469 (2004). URL
Rzepa, H. S., Murray-Rust, P. & Whitaker, B. J. The application of chemical multipurpose internet mail extensions (chemical MIME) internet standards to electronic mail and world wide web information exchange. J. Chem. Inf. Comput. Sci. 38, 976-982 (1998). URL
Willighagen, E. et al. Userscripts for the life sciences. BMC Bioinformatics 8, 487+ (2007). URL
Willighagen, E. L. & Brändle, M. P. Resource description framework technologies in chemistry. Journal of cheminformatics 3, 15+ (2011). URL

The history of the Woordenboek Organische Chemie

Chemistry students at the Radboud University in Nijmegen (then called the Catholic University of Nijmegen) got internet access in spring 1994. BTW, the catholic part only was reflected in the curriculum in that philosophy was an obligatory course. The internet access part meant a few things:
  1. xblast
  2. HTML and web servers
  3. email
Our university also had a campus-wide IT group that experimented with new technologies. So, many students had internet access via cable early on (though I do not remember when that got introduced).

During these years I was studying organic chemistry, and I started something to help me learn name reactions and trivial names. I realized that the knowledge base I had built up would be useful to others too, and hence I started the Woordenboek Organische Chemie (WOC). This project no longer exists, and is largely redundant with Wikipedia and other resources. The first public version goes back to 1996, but most of the history is lost, sadly.

Here are a few screenshots I have been able to dig up from the Internet Archive. A pretty recent version is from 2003 and this is what it looked like in those days:

The oldest version I have been able to dig up with from January 1998:

Originally, I started with specific HTML pages, but then quickly realized the importance of separating content from display. The first data format was a custom format which looks an awful lot like JSON but we later moved to the easier to work with XML. The sources are still available from SourceForge where we uploaded the data once we realized the importance of proper data licensing. This screenshot also shows that the website won Ralf Claessen's Chemistry Index award. That was in December 1997.

Unfortunately, I never published the website, which I should have because I realize each day how nice the technologies were we played with, but at least I got it mentioned in two papers. The first time was in the 2000 JChemPaint paper (doi:10.3390/50100093). JChemPaint at the time had functionality to download 2D chemical diagrams from the WOC using CAS registry numbers. The second time was in the CMLRSS paper where the WOC was one of the providers of a CMLRSS feed.

In 2004 I gave a presentation about which HTML technologies were being used in the WOC, also in London, almost 10 years ago! Darn, I should have thought of that, so that I could've mentioned that in my presentation this week! Here are the slides of back then:

Krause, S., Willighagen, E. L. & Steinbeck, C. JChemPaint - using the collaborative forces of the internet to develop a free editor for 2D chemical structures. Molecules 5, 93-98 (2000).
Murray-Rust, P., Rzepa, H. S., Williamson, M. J. & Willighagen, E. L. Chemical markup, XML, and the world wide web. 5. applications of chemical metadata in RSS aggregators. J Chem Inf Comput Sci 44, 462-469 (2004).

Friday, October 03, 2014

Jenkins-CI: automating lab processes

Our group organizes public Science Cafes where people from Maastricht University can see the research it is involved in. Yesterday it was my turn again, and I gave a presentation showing the BiGCaT and eNanoMapper Jenkins-CI installations (set up by Nuno) which I have been using for a variety of processes which Jenkins conveniently runs based on input it gets.

For example, I have it compile and run test suits for a variety of software projects (like the CDK, NanoJava), but also have it build R packages, and even daily run Andra Waagmeester's code to create RDF for WikiPathways. And the use of Jenkins-CI is not limited to dry lab processes: Ioannis Moutsatsos recently showed nice work at Novartis that uses Jenkins for high-throughput screening and data/image analysis.

Thursday, September 25, 2014

Slides at the Open PHACTS community workshop (June 26)

First MSP graduates.
It seems had not posted my slides yet of the presentation at the 6th Open PHACTS community workshop. At this meeting I gave an overview of the Programming in the Life Sciences course we give to 2nd and 3rd year students of the Maastricht Science Programme (MSP; some participants graduated this summer, see the photo on the right side).

This course will again be given this year, starting in about a month from now, and I am looking forward to all the cool apps the students come up with! Given that the Open PHACTS API has been extended with pathways and disease information, they will likely be even cooler than last year.

OpenTox Europe 2014 presentation: "Open PHACTS: solutions and the foundation"

CC-BY 2.0 by Dmitry Valberg.
Where the OpenTox Europe 2013 presentation focused on the technical layers of Open PHACTS, this presentation addressed a key knowledge management solution to scientific questions and the Open PHACTS Foundation. I stress here too, as in the slides, that the presentation is on behalf of the full consortium!

For the knowledge management, I think Open PHACTS did really interested work in the field of "identity" and am happy to have been involved in this [Brenninkmeijer2012]. The platform implementation is, furthermore, based on the BridgeDb platform, that originated in our group [VanIersel2010]. The slides outline the scientific issues addressed by this solution:

PS, many more Open PHACTS presentations are found here.

Brenninkmeijer, C. et al. Scientific lenses over linked data: An approach to support task specific views of the data. a vision. In Linked Science 2012 - Tackling Big Data (2012). URL

van Iersel, M. et al. The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11, 5+ (2010). URL

Tuesday, September 16, 2014

Do a postdoc with eNanoMapper

CC-BY-SA from Zherebetskyy @ WP.
Details will still have to follow as they are being worked out, but with Cristian Munteanu having accepted an associate professorship, I need a new postdoc to fill his place, and I am reopening the position I had almost a year ago. Do you like to works in a systems biology group (BiGCaT), are pro Open Science, like to work on tools for safe-by-design nanomaterials, and have skills in one or more of bioinformatics, chemoinformatics, statistics, coding, ontologies, then this position may be something for you.

The primary project for this position is eNanoMapper and you will be working within the large European NanoSafety Cluster network, though interactions are not limited to the EU.

If you have interest and cannot wait until the details of the position come out, please send me an email. first.lastname @ General questions about eNanoMapper and our BiGCaT solutions for nanosafety are also welcome in the comments.

Sunday, September 14, 2014

CDK: Element and Isotope information

When reading files the format in one way or another has implicit information you may need for some algorithms. Element and isotope information is a key example. Typically, the element symbol is provided in the file, but not the mass number or isotope implied. You would need to read the format specification what properties are implicitly meant. The idea here is that information about elements and isotopes is pretty standardized by other organizations such as the IUPAC. Such default element and isotope properties are exposed in the CDK by the classes Elements and Isotopes. I am extending my Groovy Cheminformatics with the CDK with these bits.

The Elements class provides information about the element's atomic number, symbol, periodic table group and period, covalent radius and Van der Waals radius, and Pauling electronegativity (Groovy code):

Elements lithium = Elements.Lithium
println "atomic number: " + lithium.number()
println "symbol: " + lithium.symbol()
println "periodic group: " +
println "periodic period: " + lithium.period()
println "covalent radius: " + lithium.covalentRadius()
println "Vanderwaals radius: " + lithium.vdwRadius()
println "electronegativity: " + lithium.electronegativity()

For example, for lithium this gives:

atomic number: 3
symbol: Li
periodic group: 1
periodic period: 2
covalent radius: 1.34
Vanderwaals radius: 2.2
electronegativity: 0.98

Similarly, there is the Isotopes class to help you look up isotope information. For example, you can get all isotopes for an element or just the major isotope:

isofac = Isotopes.getInstance();
isotopes = isofac.getIsotopes("H");
majorIsotope = isofac.getMajorIsotope("H")
for (isotope in isotopes) {
  print "${isotope.massNumber}${isotope.symbol}: " +
    "${isotope.exactMass} ${isotope.naturalAbundance}%"
  if (majorIsotope.massNumber == isotope.massNumber)
    print " (major isotope)"
  println ""

For hydrogen this gives:

1H: 1.007825032 99.9885% (major isotope)
2H: 2.014101778 0.0115%
3H: 3.016049278 0.0%
4H: 4.02781 0.0%
5H: 5.03531 0.0%
6H: 6.04494 0.0%
7H: 7.05275 0.0%

This class is also used by the getMajorIsotopeMass() method in the MolecularFormulaManipulator class to calculate the monoisotopic mass of a molecule:

molFormula = MolecularFormulaManipulator
println "Monoisotopic mass: " +

The output for ethanol looks like:

Monoisotopic mass: 46.041864812

Saturday, September 13, 2014

CDK 1.5.8, Zenodo, GitHub, and DOIs

Screenshot from John blog post.
John released CDK 1.5.8, which has a few nice goodies, like a new renderer. The full changelog is available. Interesting aspect of this release is, that it uses one ZENODO to make the release citable with a DOI. And that is relevant because it simplifies (making it a lot cheaper!) to track the impact of it, e.g. with #altmetrics. And that matters too, because no one really has a clue on how to decide which scientist is better than another, and which scientist should and should not get funding. Where we know peer review of literature is severely limited, we happily accept it to determine career future.

Anyways, so, we have a DOI now for a CDK release. So, everything using the CDK in research can cite this specific CDK release in their papers with this DOI. Of course, most publishers still don't support providing reference lists as a list of DOIs and often do not show the DOI there, but all this is a step forward. John listed the DOI with a nicely ZENODO-provided icon in the release post.

If you follow the DOI you go to the ZENODO website (they effectively act as a publishing platform). It is this page that I want to continue talking about, and in particular about the list of authors. The webpage provides two alternatives. The first is the most prominent one if you visit the page first:

This looks pretty good, I think. It seems to have picked up a list of authors, and looking at the list, not from the standard AUTHORS file, but from the commit messages. However, that is unsuited for the CDK, with a repository history in CVS, via SVN, to Git, but only the latter show up. The list seems sorted by the amount of contributions, but note that Christoph is missing. His work predates the Git era.

The second list of "authors" is given in the bottom right of the page, as "Cite As":

This suggestion is different, though it seems reasonable to assume the et al. (missing dot) refers to the rest of the authors of the first list. In the BibTeX export the full author list shows up again, supporting that idea.

Correct citation?
This makes me wonder: whom are the authors of this release? Clearly, this version includes code from all authors in some sort of way. Code from some original authors may have been long replaced with newer code. And we noted the problem of missing authors, because of the right version control history.

An alternative is to consider this release as a product of those people who have contributed patches since the previous release. In fact, this is something we noted as important in the past and now always report when making a release. For the 1.5.8 release that list looks like:

That is, this approach basically takes an accepted approach in publishing: papers describing updates of running projects involve only the people that contributed to that released work.

Therefore, I think the proper citation for this CDK 1.5.8 release should be:
    John May, Egon Willighagen, Mark Vine, Oliver Stücker, Andy Howlett, Mark Williamson, Sambit Gaan, Alison Choy (2014). cdk: CDK Release 1.5.8. ZENODO. 10.5281/zenodo.11681
Also note the correct spelling of the author names, though one can argue that they should have correctly spelled their names in the Git commit messages. Here are some challenges for GitHub in adopting the ORCID, I guess.

The question is, however, how do we get ZENODO to do this they way we want it to do? I think the above citation makes much more sense, but others may have good reasons why the current practice is better. What should ZENODO pick up to get the author provenance from?

Sunday, September 07, 2014

Open knowledge dissemination (with @IFTTT)

An important part of science is communication. That is why we publish. New insights are useless if they sit on some desk. Instead, reuse counts. This communication is not just about the facts, but also a means to establish research networks. Efficient research requires this: you cannot be an expert in everything or at least not be experienced with everything. That is, for most things you do, there is another researcher that can do it faster. This is probably one of the reasons why many Open Science projects actually work, despite limited funding: they are very efficient.

Readers of my blog know a bit about my research, and know how important data exchange is to me. But similarly, allowing people to know what I do. You can see that in my literature: I strive towards knowledge integration (think UserScripts, think CMLRSS, think Linked Open Drug Data) and efficient methods for data exchange. Just because I need this to get statistically significant patterns. After all, my background is chemometrics primarily. Cheminformatics was my hobby, explaining the mashup.

FriendFeed was a brilliant platform for disseminating research and also for exchange of data. Actually, it is a brilliant platform, but when they sold themselves to FaceBook, it got a lot quieter there. And, as said, communication needs a community, and without listeners it is just not the same. Scientists just moved to different social platforms, and it is no surprise FriendFeed didn't show up in Richard van Noorden's recent analysis. A lot of good things happened on FriendFeed, but one was that it used RSS feeds and users could indicate which information sources they liked to show up there. Try my FriendFeed account. Better even was that listeners could select which of my information sources they do not want to listen to. For example, if you were not interested in Flickr images of Person X, but the others sources were interesting, you just silenced that source. Brilliant!

But this feature of using RSS to aggregate dissemination channels is not repeated by other networks, and If This Then That fills that gap. Unlike FriendFeed it does not aggregate it, but send the items to external social networks (and many other systems), including FaceBook and Twitter. It does a lot more than RSS feeds (e.g. check out the Android app), but that is an important one for me and the point of this blog post.

Actually, they have taking the idea of sending around events to a next level, allowing you to tune how it shows up. An example of that is what you see in the screenshot: I played with how news items from the RSS with changes I made to WikiPathways are shown. After a few iterations, I ended up with this "recipe":

The grey boxes is information from the RSS feed. The first iteration (see bottom tweet in the screenshot) only contained a custom perfix "Wikipathways edit:" followed by the {{EntryTitle}} and {{EntryUrl}}. I realized that would the commit message it would not be fun, and added the {{EntryContent}} (one bot last tweet in screenshot). Then I realized that having "Pathway" twice (once from my prefix, once from the {{EntryTitle}}, was not nice to look at either, and I ended up with the above "Wiki{{EntryTitle}} edit:", as visible in the top tweet in the screenshot. Too bad the RSS feed of WikiPathways doesn't have graphics :(

At this moment I am using a few outlets: I use Twitter to send about anything, like I did with FriendFeed. Sadly, Twitter doesn't have the same power to select which tweets you like to listen to. Not interested in the changes I make to WikiPathways? Sorry, you'll have to live with it. Well, you can also try my FaceBook account, where I route fewer things. But there you can like and comment, but I will not respond.

Anyway, my message is, give IFTTT a try!

Saturday, September 06, 2014

First steps in Open Notebook Science

Scheme 2 from this Beilstein Journal of Organic
Chemistry paper
by Frank Hahn et al.
I blogged a few weeks back I blogged about my first Open Notebook Science entry. The post suggest I will look at a few ONS service providers, but, honestly, Open Notebook Science Network serves my needs well.

What I have in mind, and will soon advocate, is that the total synthesis approach from organic chemistry fits chem- and bioinformatics research. It may not be perfect, and perhaps somewhat artificial (no pun intended), but I like the idea.

Compound to Compound
Basically, a lab notebook entry should be a step of something larger. You don't write Bioclipse from scratch. You don't do a metabolomics pathway enrichment analysis in one step, either. It's steps, each one taking you from one state to another. Ah, another nice analogy (see automata theory)! In terms of organic chemistry, from one compound to another. The importance here is that the analogy shows that there is no step you should not report. The same applies to cheminformatics: you cannot report a QSAR model without explaining how your cleaned up that SDF file you got from paper X (which still commonly is practised).

Methods Sections
Organic chemistry literature has well-defined templates on how to report the method for a reaction, including minimal reporting standards for the experimental results. For example, you must report chemical shifts, an elemental composition. In cheminformatics we do not have such templates, but there is no reason not too. Another feature that must be reported is the yield.

Reaction yield
The analogy with organic chemistry continues: each step has a yield. We must report this. I am not sure how, and this is one of the things I am exploring and will be part of my argument. In fact, the point of keeping track of variance introduced is something I have been advocating for longer. I think it really matters. We, as a research field, now publish a lot of cheminformatics and chemometrics work, without taking into account the yield of methods, though, for obvious reasons, very much more in chemometrics than in cheminformatics. I won't go into that now, but there is indeed a good part of benchmark work, but the point is, any cheminformatics "reaction" step should be benchmarked.

Total synthesis
The final aspect is, is that by taking this analogy, there is a clear protocol how cheminformatics, or bioinformatics, work must be reported: as a sequence of detailed small steps. It also means that intermediate "products" can be continued with in multiple ways: you get a directed graph of methods you applied and results you got.

You get something like this:

Created with Graphviz Workspace.

The EWx codes refer to entries in my lab notebook:
  1. EW4: Finding nodes in Anopheles gambiae pathways with IUPAC names
  2. EW5: Finding nodes in Homo sapiens pathways with IUPAC names
  3. EW6: Finding nodes in Rattus norvegicus pathways with IUPAC names
  4. EW7: converting metabolite Labels into DataNodes in WikiPathways GPML

Open Notebook Science
Of course, the above applies also if you do not do Open Notebook Science (ONS). In fact, the above outline is not really different from how I did my research before. However, I see value in using the ONS approach here. By having it Open, it

  1. requires me to be as detailed as possible
  2. allows others to repeat it
Combine this with the advantage of the total synthesis analogy:
  1. "reactions" can be performed in reasonable time
  2. easy branching of the synthesis
  3. clear methodology that can be repeated for other "compounds
  4. step towards minimal reporting standards for cheminformatics methods
  5. clear reporting structure that is compatible with journal requirements
OK, that is more or less the paper I want to write up and submit to the Jean-Claude Bradley Memorial Issue in the Journal of Cheminformatics and Chemistry Central. It is an idea, something that helps me, and I hope more people find useful bits in this approach.

Saturday, August 30, 2014

On Open Access in The Netherlands

Yesterday, I received a letter from the Association of Universities The Netherlands (VSNU, @deVSNU) about Open Access. The Netherlands is for research a very interesting country: it's small, meaning we have few resources to establish and maintain high profile centers, we also believe strong education benefits from distribution, so we we have many good universities, rather than a few excelling universities. Mind you, this clouds that we absolutely do have excelling research institutes and research groups; they just are not concentrated in one university.

Another important aspect is that all those Dutch universities are expected to compete which each other for funding. As a result I have experience rather interesting collaborations between universities. That's a downside of a small country: everyone knows each other, often in way to much detail. But my point is that the Dutch can be rather conservative. That kills innovation, and is in my opinion a key reason why we are not breaking into the top 50 of rankings, more than concentration. Concentration of funding in Top research institutes has not been extensively evaluated, but I think the efficiency is not proven higher than previous funding approaches.

Anyway, this letter I received is part of their Open Access program. Here too, the Dutch universities are conservative (well, relatively from my views, at least). Now, the Open Access debate is not so interesting, because it primarily ends up about who pays who (boring) and whether we should go gold or green (besides the point, see below), and, sadly, here too many people think about who pays who again (still boring).

Therefore, giving the outlined importance and impact of Dutch research, I found it relevant to post about the progress of Open Access in my small country. The letter is available in English.

Basically, the letter is an answer to an earlier letter from our government about Open Access, and it warns about actions that will soon be undertaken (so, not really pro-active). However,
    "[they] are also appealing to you to continue to advocate free access to your own scientific publications."
Well, I have, not so actively, and maybe this post can be the start of a change. Because what basically bothers me is that the Open Access discussion, also in The Netherlands, is biased. And indeed, the letter continues with a section about gold and green access. If the VSNU really wants to promote free access to research, it should not even accept green. We all know that it is not about being able to look at (free), but to be able to mix and improve. Reuse. Continue. Stand on shoulders. The fact that this letter focuses on publications only, does not spend a word on reuse, is rather depressing and not giving me even the slightest hint that The Netherlands will break into that Top 50 any time soon.

Overall, the latter is relatively positive for the Open Access movement, though reactive. They still have some explanation to do:
    "The golden route is more complex. However, many believe that in the end it is a
    more sustainable route to Open Access."
(Or maybe readers can explain me what is complex about the golden route?)

The following is a rather interesting section, but really only when they had focused on Open Access in its pure form that allows research reuse. I think it now leaves you with a low starting point bargaining with resistant publisher lawyers and managers that have long lost the interest of the academics in favor of that of the share holders:
    For the past ten years, publishers have been offering journals in package deals referred to as Big Deals. Shortly negotiations with the major publishers about these Big Deals Will take place, including Elsevier, Springer and Wiley. The Dutch universities have expressed their wish to make agreements with these publishers about the transition to Open Access as part of those Big Deals. Universities expect publishers to take serious steps to facilitate that transition.
I hope the VSNU will clarify with what they mean with "serious". Because they all came up with "me too" solutions (setting up new OA journals) without seriously changing their model. No large publisher dared making the flagship journals full gold Open Access. That is serious business; all we see now is scribbling in the margin.

Perhaps that is the reason of the wish to be in the top 50. Maybe the VSNU just wants a better bargaining position.

The letter ends with what researchers can do. And with that, they are spot on:
    As a researcher, you can play a vital role in the transition to Open Access. We have 
    mentioned the possibility of depositing arlídes in the repository of your own
    university. But there is more. It’s important to consider that researchers play a key 
    role in the publishing process: as providers of the scientific content, as reviewers 
    and as members of editorial and advisory boards. We hope that where ever possible, 
    you will ask publishers to convert to an Open Access model.
What any researcher can already do to promote (proper) Open Access:

  1. stop reviewing publishing closed-access papers (you have way too much review requests already, and some filtering will not hurt you)
  2. stop reviewing publishing for non-gold Open Access journals (step further than the first item)
  3. submit only to full-gold Open Access journals (plenty of options; importantly, the quality and impact of your paper is not dependent on the journal, but on you. if not, you're just a bad author and researcher and should go back to school or start learning from feed back on your Open Notebook Science, so that you improve your act before you submit; really, it happens to the best of us: multidisciplinary research is hard: you cannot excel in biology and chemistry and statistics and informatics and computer science and data analysis and materials science and as perfect and creative linguistic (well, not all of us, anyway))
  4. put your previous mistakenly closed-access papers in university repositories (most Dutch universities have solutions; not all yet)
  5. make previously published closed-access papers gold Open Access (yes, you can! I am in the process of doing this for the CDK I paper, and other ACS papers will follow)
  6. get an ORCID
  7. use #altmetrics to see that gold Open Access gives you more impact for your papers too (service providers include ImpactStory,, Plum Analytics, etc)
Of course, it is not only about publications. Again, the VSNU would do good to learn that research is not the same as publications. Besides sending letters, I think the VSNU can do this to promote Open Science, which is what I hope they are after:
  1. negotiate with the government and major science and funding agencies (KNAW, NWO) to stop focusing on publications as primary output
  2. start focusing on output other than publications (e.g. data sets, software) even if you have not ended negotiations with other, just to set a proper example
  3. make research outcomes machine readable (read this interesting post from our national library)
  4. actively explore business models around Open Science (and not have your universities' spin-off departments only know about patent law, ignore the rest of the world)
  5. adopt the ORCID nation wide, staring Jan 2015
  6. start using #altmetrics to get a better perspective of the performance of your members
Of course, I am more than willing to help the VNSU with this transition. I can be reached at the Department of Bioinformatics - BiGCaT, NUTRIM, FHML, Maastricht University. There are many options I have missed here (like data repositories, data citing, DOIs, and whatever).

PS. my ImpactStory profile will tell you that more than 80% of my publications are Open Access. Not all gold yet, but I am working on changing that for some old papers.

Tuesday, July 22, 2014

Open Notebook Science ONSSP #1:

As promised, I slowly set out to explore ONSSPs (Open Notebook Science Service Providers). I do not have a full overview of solutions yet but found LabTrove and Open Notebook Science Network. The latter is a more clear ONSSP while the first seems to be the software.

So, my first experiment is with Open Notebook Science Network (ONSN). The platform uses WordPress, a proven technology. I am not a huge fan of the set up which has a lot of features making it sometimes hard to find what you need. Indeed, my first write up ended up as a Page rather than a Post. On the upside, there is a huge community around it, with experts in every city (literally!). But my ONS is now online and you can monitor my Open research with this RSS feed.

One of the downsides is that the editor is not oriented at structured data, though there is a feature for Forms which I may need to explore later. My first experiment was a quick, small hack: upgrade Bioclipse with OPSIN 1.6. As discussed in my #jcbms talk, I think it may be good for cheminformatics if we really start writing up step-by-step descriptions of common tasks.

My first observations are that it is an easy platform to work with. Embedding images is easy, and there should be option for chemistry extensions. For example, there is a Jmol plugin for WordPress, there are plugins for Semantic Web support (no clue which one I would recommend), an extensions for bibliographies are available too, if not mistaken. And, we also already see my ORCID prominently listed, and I am not sure if I did this, or whether this the ONSN people added this as a default feature.

Even better is the GitHub support @ONScience made me aware of, by @benbalter. The instructions were not crystal clear to me (see issues #25 and #26), some suggested fixes (pull request #27), it started working, and I now have a backup of my ONS at GitHub!

So, it looks like I am going to play with this ONSSP a lot more.

Friday, July 18, 2014

Open Notebook Science: also for cheminformatics

Last Monday the Jean-Claude Bradley Memorial Symposium was held in Cambridge (slide decks). Jean-Claude was a remarkable man and I spoke at the meeting on several things and also how he made me jealous with his Open Notebook Science work. I had the pleasure to work with him on a RDF representation of solubility data.

It took me a long time to group my thoughts and write the abstract I submitted to the meeting:
    I always believed that with Open Data, Open Source, and Open Standards I was doing the right thing; that it was enough for a better science. However, I have come to the realization that these features are not enough. Surely, they aid Open collaborations, though not even sufficient there, but they fail horribly in the "scientific method." Because while ODOSOS makes work reproducible, it lacks the context needed by scholars to understand what it solved. That is, it details out in much detail how some scientific question is answered, but not what question that was. As such, it fails to follow the established practices in scholarly research. In this presentation I will show how I should have done some of my research, and ponder on reasons why I had not done so.
And it also took me a long time and a lot of stress to get together some slides, but I managed in the end:

During the talk I promised to start doing Open Notebook Science (ONS) for my research, and I am currently exploring ONS platforms.

The meeting itself was great. There was a group of about 40 people in Cambridge and another 15 online, and most of them into Open Science or at least wanting to learn what it is about. I met old friends and new people, including a just-graduated Maastricht Science Programme student (one that I did not have in my class last year). Coverage on Twitter was pretty good (using the #jcbms hashtag, an archive) with some 90 people using the hashtag.
Several initiatives seem to be evolving, including an ONS initiative and a memorial special issue. All these will need to help from the community. The time is right.