Pages

Sunday, July 30, 2017

Wikidata visualizes SMILES strings with John Mayfield's CDK Depict


SVG depiction of D-ribulose.
Wikidata is building up a curated collection of information about chemicals. A lot of data originates from Wikipedia, but active users are augmenting this information. Of particular interest, in this respect, is Sebastian's PubChem ID curation work (he can use a few helping hands!). Followers of my blog know that I am using Wikidata as source of compound ID mapping data for BridgeDb.

Each chemical can have one or two associated SMILES strings. A canonical SMILES, that excludes any chirality, and a isomeric SMILES that does include chirality. Because statement values can be linked to a formatter URL, Wikidata often has values associated with a link. For example, for the EPA CompTox Dashboard identifiers it links to that database. Kopiersperre used this approach to link to John Mayfield's CDK Depict.

Until two weeks ago, the formatter URL for both the canonical and isomeric SMILES was he same. I changed that, so that when a isomeric SMILES is depicted, it shows the perceived R,S (CIP) annotation as well. That should help further curation of Wikidata and Wikipedia content.

Wednesday, July 05, 2017

new paper: "A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury"

Figure from the article. CC-BY.
One of the projects I worked on at Karolinska Institutet with Prof. Grafström was the idea of combining transcriptomics data with dose-response data. Because we wanted to know if there was a relation between the structures of chemicals (drugs, toxicants, etc) and how biological systems react to that. Basically, testing the whole idea behind quantitative-structure activity relationship (QSAR) modeling.

Using data from the Connectivity Map (Cmap, doi:10.1126/science.1132939) and NCI60, we set out to do just that. My role in this work was to explore the actual structure-activity relationship. The Chemistry Development Kit (doi:10.1186/s13321-017-0220-4) was used to calculate molecular descriptor, and we used various machine learning approaches to explore possible regression models. Bottom line was, it is not possible to correlate the chemical structures with the biological activities. We explored the reason and ascribe this to the high diversity of the chemical structures in the Cmap data set. In fact, they selected the chemicals in that study based on chemical diversity. All the details can be found in this new paper.

It's important to note that these findings does not validate the QSAR concept, but just that they very unfortunately selected their compounds, making exploration of this idea impossible, by design.

However, using the transcriptomics data and a method developed by Juuso Parkkinen it is able to find multivariate patterns. In fact, what we saw is more than is presented in this paper, as we have not been able to support further findings with supporting evidence yet. This paper, however, presents experimental confirmation that predictions based on this component model, coined the Predictive Toxicogenocics Gene Space, actually makes sense. Biological interpretation is presented using a variety of bioinformatics analyses. But a full mechanistic description of the components is yet to be developed. My expectation is that we will be able to link these components to key events in biological responses to exposure to toxicants.

 Kohonen, P., Parkkinen, J. A., Willighagen, E. L., Ceder, R., Wennerberg, K., Kaski, S., Grafström, R. C., Jul. 2017. A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury. Nature Communications 8. 
https://doi.org/10.1038/ncomms15932

Saturday, June 24, 2017

The Elsevier-SciHub story

I blogged earlier today why I try to publish all my work gold Open Access. My ImpactStory profile shows I score 93% and note that with that 10% of the scientists in general score in that range. But then again, some publisher do make it hard for us to publish gold Open Access. And then if STM industries spreads FUD for their and only their good ("Sci-Hub does not add any value to the scholarly community.", doi:10.1038/nature.2017.22196), I get annoyed. Particularly, as the system makes young scientists believe that transferring copyright to a publisher (for free, in most cases) is a normal thing to do.

As said, I have no doubt that under current copyright law it was to be expected that Sci-Hub was going to be judged to violate that law. I also blogged previously that I believe copyright is not doing our society a favor (mind you, all my literature is copyrighted, and much of it I license to readers allowing them to read my work, copy it (e.g. share it with colleagues and students), and even modify it, e.g. allowing journals to change their website layout without having to ask me). About copyright, I still highly recommend Free Culture by Prof. Lessig (who unfortunately did not run for presidency).

To get a better understand of Sci-Hub and its popularity (I believe gold Open Access is the real solution), I looked at what literature was in Wikidata, using Scholia (wonderful work by Finn Nielsen, see arXiv). I added a few papers and annotated papers with their main subject's. I guess there must be more literature about Sci-Hub, but this is the "co-occuring topics graph" provided by Scholia at the time of writing:


It's a growing story.

As a PhD student, I was often confronted with Closed Access.

It sounds like a problem not so common in western Europe, but it was when I was a fresh student (around 1994). The Radboud's University Library certainly did not have all journals and for one journal I had to go to a research department and sit in their coffee room. Not a problem at all. Big Package deals improved access, but created a vendor lock-in. And we're paying Big Time for these deals now, with insane year-over-year inflation of the prices.

But even then, I was repeatedly confronted with not having access to literature I wanted to read. Not just me, btw, for PhD students this was very common too. In fact, they regularly visited other universities, just to make some copies there. An article basically costed a PhD a train travel and a euro or two copying cost (besides the package deal cost for the visited university, of course). Nothing much has changed, despite the fact that in this electronic age the cost should have gone down significantly, instead of up.

That Elsevier sues Sci-Hub (about Sci-Hub, see this and this), I can understand. It's good to have a court decide what is more important: Elsevier's profit or the human right of access to literature (doi:10.1038/nature.2017.22196). This is extremely important: how does our society want to continue: do we want a fact-based society, where dissemination of knowledge is essential; or, do we want a society where power and money decides who benefits from knowledge.

But the STM industry claiming that Sci-Hub does not contribute to the scholarly community is plain outright FUD. In fact, it's outright lies. The fact that Nature does not call out those lies in their write up is very disappointing, indeed.

I do not know if it is the ultimate solution, but I strongly believe in a knowledge dissemination system where knowledge can be freely read, modified, and redistributed. Whether Open Science, or gold Open Access.

Therefore, I am proud to be one of the 10 Open Access proponents at Maastricht University. And a huge thank you to our library to keep pushing Open Access in Maastricht.


Sunday, June 11, 2017

You are what you do, or how people got to see me as an engineer

Source, Wikicommons, CC-BY-SA.
Over the past 20 years I have had endless discussions into what the research is that I do. Many see my work as engineer, but I vigorously disagree. But some days it's just too easy to give up and explain things yet again. The question came up on the past few month several times again, and I am suggested to make a choice. That modern academia for you: you have to excel in something tiny, and complex and hard to explain ambition is loosing from the system based on funding, buzz words, "impact", and such. So, again, I am trying to make up my defense as to why my research is not engineering. You know what is ironic? It's all the fault of Open Science! Darn Open Science.

In case you missed it (no worries, many of the people I talk in depth about these things do, IMHO), my research is of theoretical nature (I tried bench chemistry, but my back is not strong enough for that): I am interested in how to digitally represent chemical knowledge. I get excited about Shannon entropy and books from Hofstadter. I do not get excited about "deep learning" (boring! In fact, the only fun I get out of that is pointing you to this). So, arguably, I am in the wrong field of science. One could argue I am not a biologist or chemist, but a computer scientist, or maybe philosophy (mind you, I have a degree in philosophy).

And that's actually where it starts getting annoying. Because I do stuff on a computer, people associate me with software. And software is generally seen as something that Microsoft does... hello, engineering. The fact that I publish papers on software (think CDK, Bioclipse, Jmol) does not help, of course.

That's where that darn Open Science comes in. Because I have a varied set of skills, I actually know how to instruct a computer to do something for me. It's like writing English, just to a different person, um, thingy. Because of Open Science, I can build the machines that I need to do my science.

But a true scientist does not make their own tools; they buy them (of course, that's an exaggeration, but just realize how well we value data and software citations at this time). They get loads of money to do so, just so that they don't have to make machines. And just because I don't ask for loads of money, or ask a bit of money to actually make the tools I need, you are tagged as engineer. And I, I got tricked by Open Science in fixing things, adding things. What was I thinking??

Does this resonate with experience from others? Also upset about it? What can we do about this?

(So, one of my next blog posts will be about the new scientific knowledge I have discovered. I have to say,  not as much as I wanted, mostly because we did not have the right tools yet, which I have to build first, but that's what this post is about...)