Pages

Sunday, November 16, 2014

Programming in the Life Sciences #20: extracting data from JSON

I previously wrote about the JavaScript Object Notation (JSON) which has become a de facto standard for sharing data by web services. I personally still prefer something using the Resource Description Framework (RDF) because of its clear link to ontologies, but perhaps JSON-LD combines the best of both worlds.

The Open PHACTS API support various formats and this JSON is the default format used by the ops.js library. However, the amount of information returned by the Open PHACTS cache is complex, and generally includes more than you want to use in the next step. Therefore, it is needed to extract data from the JSON document, which was not covered in the post #10 or #11.

Let's start with the example JSON given in that post, and let's consider this is the value of a variable with the name jsonData:

{
    "id": 1,
    "name": "Foo",
    "price": 123,
    "tags": [ "Bar", "Eek" ],
    "stock": {
        "warehouse": 300,
        "retail": 20
    }
}

We can see that this JSON value starts with a map-like structure. We can also see that there is a list embedded, and another map. I guess that one of the reasons why JSON has taken such a flight is how well it integrates with the JavaScript language: selecting content can be done in terms of core language features, different from, for example, XPath statements needed for XML or SPARQL for RDF content. This is because the notation just follows core data types of JavaScript and data is stored as native data types and objects.

For example, to get the price value from the above JSON code, we use:

var price = jsonData.price;

Or, if we want to get the first value in the Bar-Eek list, we use:

var tag = jsonData.tags[0];

Or, if we want to inspect the warehouse stock:

var inStock = jsonData.stock.warehouse;

Now, the JSON returned by the Open PHACTS API has a lot more information. This is why the online, interactive documentation is so helpful: it shows the JSON. In fact, given that JSON is so much used, there are many tools online that help you, such as jsoneditoronline.org (yes, it will show error messages if the syntax is wrong):


BTW, I also recommend installing a JSON viewer extension for Chrome or for Firefox. Once you have installed this plugin, you can not just read the JSON on Open PHACTS' interactive documentation page, but also open the Request URL into a separate browser window. Just copy/paste the URL from this output:


And with a JSON viewing extension, opening this https://beta.openphacts.org/1.3/pathways/... URL in your browser window will look something like:


And because these extensions typically use syntax highlighting, it is easier to understand how to access information from within your JavaScript code. For example, if we want the number of pathways in which the compound testosterone (the link is the ConceptWiki URL in the above example) is found, we can use this code:

var pathwayCount = jsonData.result.primaryTopic.pathway_count;

Programming in the Life Sciences #19: debugging

Debugging is the process find removing a fault in your code (the etymology goes further back than the moth story, I learned today). Being able to debug is an essential programming skill, and being able to program flawlessly is not enough; the bug can be outside your own code. (... there is much that can be written up about module interactions, APIs, documentation, etc, that lead to malfunctioning code ...)

While there are full debugging tools, achieving the task of finding where the bug is can often be reached with simpler means:

  1. take notice of error messages
  2. add debug statements in your code
Error messages
Keeping track of error messages is first starting point. This skill is almost an art: it requires having seen enough for them to understand how to interpret them. I guess error messages are the worst developed aspects of programming language, and I do not frequently see programming language tutorial that discuss error messages. The field can certainly improve here.

However, at least error messages in general give an indication where the problem occurs. Often by a line number, though this number is not always accurate. Underlying causes of that are the problem that if there is a problem in the code, it is not always clear what the problem is. For example, if there is a closing (or opening) bracket missing somewhere, how can the compiler decide what the author of the code meant? Web browsers like Firefox/Iceweasel and Chrome (Ctrl-C) have a console that displays compiler errors and warnings:


Another issue is that error messages can be cryptic and misleading. For example, the above error message "TypeError: searcher.bytag is not a function example1.html:73" is confusing for a starting programmer. Surely, the source code calls searcher.bytag() which definately is a function. So, why does the compiler say it is not?? The bug here, of course, is that the function called in the source code is not found: it should be byTag().

But this bug at least can be detected during interpretation and executing of the code. That is, it is clear to the compiler that it doesn't know how to handle the code. Another common problem is the situation where the code looks fine (to the compiler), but the data it handles makes the code break down. For example, an variable doesn't have the expected value, leading to errors (e.g. null pointer-style). Therefore, understanding the variable values at a particular point in your code can be of great use.

Console output
A simple way to inspect the content of a variable is to use this console visible in the above screenshot. Many programming languages have their custom call to send output there. Java has the System.out.println() and JavaScript has console.log()


Thus, if you have some complex bit of code with multiple for-loops, if-else statements, etc, this can be used to see if some part of your code that you expect to be called really is:

console.log("He, I'm here!");

This can be very useful when using asynchronous web service calls! Similarly, see what the value of some variable is:

var label = jsonResponse.items[i].prefLabel;
console.log("label: " + label);

Also, because JavaScript is not a strongly typed programming I frequently find myself inspecting the data type of a variable:

var label = jsonResponse.items[i].prefLabel;

console.log("typeof label: " + typeof(label));

Conclusion
These tools are very useful to find the location of a bug. And this matters. Yesterday I was trying to use the histogram code in example6.html to visualize a set of values with negative numbers (zeta potentials of nanomaterials, to be precise) and I was debugging the issue, trying to find where my code when wrong. I used the above approaches, and the array of values looked in order, but different from the original example. But still the histogram was not showing up. Well, after hours, and having asked someone else to look at the code too, and having ruled out many alternatives, she pointed out that the problem was not in the JavaScript part of the code, but in the HTML: I was mixing up how default JavaScript and the d3.js library add SVG content to the HTML data model. That is, I was using <div id="chart">, which works with document.getElementById("chart").innerHTML, but needed to use <div class="chart"> with the d3.select(".chart").innerHTML code I was using later.

OK, that bug was on my account. However, it still was not working: I did see a histogram, but it didn't look good. Again debugging, and after again much too long, I found out that this was a bug in the d3.js code that makes it impossible to use their histogram example code for negative values. Again, once I knew where the bug was, I could Google and quickly found the solution for it on StackOverflow.

So, the workflow of debugging at a top level, looks like:
  1. find where the problem is
  2. try to solve the problem

Happy debugging!

Programming in the Life Sciences #18: Molecular weight distribution of compounds with measured activities against a target (and other examples)

Eating your own dog food is an rather useful concept in anything where a solution or product can change over time. This applies to science as much as programming. Even when we think things are static, they may not really be. This is often because we underestimate or are just ignorant against factors that influence the outcome. By repeatedly dogfooding, the expert will immediately recognize the effect of different influencing factors.

Examples? A politician that actually lives in a neighborhood where he develops policies for. A principle investigator that tries to reproduce an experiment himself from one of her/his postdocs or PhD students. And, of course, the programmer that should use his own libraries himself.

Dogfooding, however, is not the single solution to development; in fact, it can be easily integrated with other models. But it can serve as an early warning system, as the communication channels between you and yourself are typically much smaller than between you and the customer: citizen, peer reviewer, and user, following the above examples. Besides that, it also helps you better understand the things that is being developed, because you will see factors that influence in action and everything becomes more empirical, rather than just theoretical ("making money scarce is a good incentive for people to get of the couch", "but we have been using this experiment for years", "that situation in this source code will never be reached", etc).

And this also applies when teaching. So, you check the purity of the starting materials in your organic synthesis labs, and you check if your code examples still run. And you try things you have not done before, just to test the theory that if X is possible, Y should be possible too, because that is what you tell your students.

As an example, I told the "Programming in the Life Sciences" students that in literature researchers compare properties of actives and inactives. For example, the molecular weight. Just to get some idea of what data you are looking at, up to uses of things like Lipinski's Rule of Five. Therefore, I developed a HTML+JavaScript page using Ian Dunlop's excellent ops.js and the impressing d3.js library to use the Open PHACTS Application Programming Interface:


And compared to last year when only the source was available, all these examples can now be tested online on the following GitHub pages (using their brilliant gh_pages system):

  • Example 1: simple example where the Open PHACTS Identity Resolution System (name to identifier) system is used
  • Example 4: uses d3.js to show a bar plot of the number of times a particular unit is used to measure activities of paracetamol
  • Example 5: the same as example 3, but then as pie chart
  • Example 6: the above molecular weight example
Of course, what the students last year and probably this year will produce is much more impressive. And, of course, compared to full applications (I recommend browsing this list by the Open PHACTS Foundation), these are just mock ups, and they are. These examples are just like figures in a paper, making a specific point. But that is how these pages are used: as arguments to answer a biological question. In fact, and that is outside the scope of this course, just think of what you can do with this approach in terms of living research papers. Think Sweave!

Thursday, November 06, 2014

Programming in the Life Sciences #17: The Open PHACTS scientific questions

Data needs for answering the scientific questions. From
the paper discussed in this post (Open Access).
I think the authors of the Open PHACTS proposal made a right choice in defining a small set of questions that the solution to be developed could be tested against. The questions being specific, it is much easier to understand the needs. In fact, I suspect it may even be a very useful form of requirement analysis, and makes it hard to keep using vague terms. Open PHACTS has come up with 20 questions (doi:10.1016/j.drudis.2013.05.008; Open Access):

  1. Give me all oxidoreductase inhibitors active <100 nM in human and mouse
  2. Given compound X, what is its predicted secondary pharmacology? What are the on- and off-target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?
  3. Given a target, find me all actives against that target. Find/predict polypharmacology of actives. Determine ADMET profile of actives
  4. For a given interaction profile – give me similar compounds
  5. The current Factor Xa lead series is characterized by substructure X. Retrieve all bioactivity data in serine protease assays for molecules that contain substructure X
  6. A project is considering protein kinase C alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that could modulate the target directly? I.e. return all compounds active in assays where the resolution is at least at the level of the target family (i.e. PKC) from structured assay databases and the literature
  7. Give me all active compounds on a given target with the relevant assay data
  8. Identify all known protein–protein interaction inhibitors
  9. For a given compound, give me the interaction profile with targets
  10. For a given compound, summarize all ‘similar compounds’ and their activities
  11. Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with options to match stereochemistry or not)
  12. For my given compound, which targets have been patented in the context of Alzheimer's disease?
  13. Which ligands have been described for a particular target associated with transthyretin-related amyloidosis, what is their affinity for that target and how far are they advanced into preclinical/clinical phases, with links to publications/patents describing these interactions?
  14. Target druggability: compounds directed against target X have been tested in which indications? Which new targets have appeared recently in the patent literature for a disease? Has the target been screened against in AZ before? What information on in vitro or in vivo screens has already been performed on a compound?
  15. Which chemical series have been shown to be active against target X? Which new targets have been associated with disease Y? Which companies are working on target X or disease Y?
  16. Which compounds are known to be activators of targets that relate to Parkinson's disease or Alzheimer's disease
  17. For my specific target, which active compounds have been reported in the literature? What is also known about upstream and downstream targets?
  18. Compounds that agonize targets in pathway X assayed in only functional assays with a potency <1 μM
  19. Give me the compound(s) that hit most specifically the multiple targets in a given pathway (disease)
  20. For a given disease/indication, give me all targets in the pathway and all active compounds hitting them
Students in the Programming in the Life Sciences course will this year pick one of these questions as a starting point in the project. The goal is to develop a HTML+JavaScript solution that will answer the question the selected. There is freedom to tweak the question to personal interests, of course. By selecting a simpler pharmacological question that last year, more time and effort can be put into visualization and interpretation of the found data.

Saturday, October 25, 2014

The Web - What is the issue?

From Wikipedia.
Last week I gave an invited presentation in the nice library of the Royal Society of Chemistry, at the What's in a Name? The Unsung Heroes of Open Innovation: Nomenclature and Terminology meeting. I was asked to speak about HTML in this context, something I have worked with as channel for communication of scientific knowledge and data for almost 20 years know. Mostly in the area of small molecules, starting with the Dictionary of Organic Chemistry, which is interesting because I presented the web technologies behind this project also in London, October 10 years ago!

As a spoiler, the bottom line of my presentation is that we're not even using 10% of what the web technologies have to offer us. Slowly we are getting there, but too slow in my opinion. For some weird behavioral law, the larger the organization the less innovation gets done (some pointers).

Anyway, I only had 20 minutes, and in that time you cannot do justice to the web technologies.

Papers that I mention in these slides are given below.
Wiener, H. Structural determination of paraffin boiling points. Journal of the American Chemical Society 69, 17-20 (1947). URL http://dx.doi.org/10.1021/ja01193a005.
Murray-Rust, P., Rzepa, H. S., Williamson, M. J. & Willighagen, E. L. Chemical markup, XML, and the world wide web. 5. applications of chemical metadata in RSS aggregators. J Chem Inf Comput Sci 44, 462-469 (2004). URL http://repository.ubn.ru.nl/bitstream/2066/60101/1/60101.pdf.
Rzepa, H. S., Murray-Rust, P. & Whitaker, B. J. The application of chemical multipurpose internet mail extensions (chemical MIME) internet standards to electronic mail and world wide web information exchange. J. Chem. Inf. Comput. Sci. 38, 976-982 (1998). URL http://dx.doi.org/10.1021/ci9803233.
Willighagen, E. et al. Userscripts for the life sciences. BMC Bioinformatics 8, 487+ (2007). URL http://dx.doi.org/10.1186/1471-2105-8-487.
Willighagen, E. L. & Brändle, M. P. Resource description framework technologies in chemistry. Journal of cheminformatics 3, 15+ (2011). URL http://dx.doi.org/10.1186/1758-2946-3-15.