Pages

Wednesday, November 22, 2017

Monitoring changes to Wikidata pages of your interest

Source: User:Cmglee, Wikipedia, CC-BY-SA 3.0
Wikidata is awesome! In just 5 years they have bootstrapped one of the most promising platforms for the future of science.Whether you like the tools more, or the CCZero, there is coolness for everyone. I'm proud to have been able to contribute my small 1x1 LEGO bricks to this masterpiece and hope to continue this for many years to come. There are many people doing awesome stuff, and many have way more time, have better skills, etc. Yes, I'm thinking here if Finn, Magnus, Andra, the whole Su team, and many, many more.

The point of this post, is to highlight something this matters and something that comes up over and over again and where there just are solutions, like implemented by Wikidata: provenance. We're talking a lot about FAIR data. Most of FAIR data is not technological, it's social. And most of the technical work going on now, is basically to overcome those social barriers.

We teach our students to cite primarily literature and only that. There is a clear reason for that: the primary literature has the arguments (experiments, reasoning, etc) that back a conclusion. Not any citing is good enough: it has to be the exact right shape (think about that Lego brick). This track record of our experiments is a wonderful and essential idea. It removes the need for faith and even trust. Faith is for the religious, trust is for the lazy. Now, without being lazy, it is hard to make progress. But as I have said before (Trust has no place in science #2), every scholar should realize that "trust" is just a social equivalent of saying you are lazy. There is nothing wrong with being lazy: a side effect of it is innovation.

Ideally, we do not have to trust any data source. If we must, we just check where that source got its data from. That works for scholarly literature, and works for other sources too. Sadly, scholarly literature has a horrible track record here: we only cite stuff we find more trustworthy. For example, we prefer to cite articles from journals with high impact factors. Second, we don't cite data. Nor software. As a scholarly community, we don't care much about that (this is where lazy is evil, btw!).

Wikidata made the effort to make a rich provenance model. It has a rich system of referring to information sources. It has version control. And it keeps track of who made the changes.

Of all the awesomeness of Wikidata, Magnus is one of the people that know how to use that awesomeness. He developed many tools that make doing to right thing a lot easier. I'm a big fan of his SourceMD, QuickStatement, and two newer tools, ORCIDator and SPARQL-RC. This latter tool leverages SPARQL (and thus Wikidata RDF) and the version control system. By passing a query, it will list all changes in a given time period. I am still looking for a tool that can show my all changes for items I originally created, but this already is a great tool to monitor the quality of crowdsourcing for data in Wikidata I care about. No trust, but the ability to verify.

Here's a screenshot for the changes of (some of my) output of scientific output I am author of:


Sunday, November 12, 2017

New paper: "WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research"


Focus on metabolic pathways increases the
number of annotated metabolites, further improving
the usability in metabolomics. Image: CC-BY.
TL;DR: the WikiPathways project (many developers in the USA and Europe, contributors from around the world, and many people curating content, etc) has published a new paper (doi:10.1093/nar/gkx1064/4612963), with a slight focus on metabolism. 

Full story
Almost six years ago my family and I moved back to The Netherlands for personal reasons. Workwise, I had a great time in Stockholm and Uppsala (two wonderful universities; thanks to Ola Spjuth, Bengt Fadeel, and Roland Grafström), but being immigrant in another country is not easy, not even for a western immigrant in a western country. ("There is evil among us.")

We had decided to return to our home country, The Netherlands. By sheer coincidence, I spoke with Chris Evelo in the week directly following that weekend. I had visited his group in March that year, while attending a COST-action about NanoQSAR in Maastricht. I had never been to Maastricht University yet, and this group, with their Open Source and Open Data projects, particularly WikiPathways, would give us enough to talk about. Chris had a position on the Open PHACTS project open. I was interested, applied, and ended up in the European WikiPathways group led by Martina Kutmon (the USA node is the group of Alex Pico).

Fast forward to now. It was clear to me that biological text book knowledge was unusable for any kind of computation or machine learning. It was hidden, wrongly represented, and horribly badly annotated. In fact, it still is a total mess. WikiPathways offered machine readable text book knowledge. Just what I needed to link the chemical and biological worlds. The more accurate biological annotation we put in these pathways, or semantically link to these pathways, the more precise our knowledge becomes and the better computational approaches can find and learn patterns not obvious to the human eye (it goes both ways, of course! Just read my PhD thesis.)

Over the past 5-6 years I got more and more involved in the project. Our Open PHACTS tasks did involve WikiPathways RDF (doi:10.1371/journal.pcbi.1004989), but Andra Waagmeester (now Micelio) was the lead on that. I focused on the Identifier Mapping Service, based on BridgeDb (together with great work from Carole Goble's lab, e.g. Alasdair and Christian). I focused on metabolomics.

Metabolomics
Indeed, there was plenty to be done in terms of metabolic pathways in WikiPathways. The current database had a strong focus on the genetics and proteins aspects of the pathways. In fact, many metabolites were not datanodes and therefore did not have identifiers. And without identifiers, we cannot map metabolomics data to these pathways. I started working on improving these pathways, and we did some projects using it for metabolomics data (e.g. a DTL Hotel Call project led by Lars Eijssen).

The point of this long introductions is, I am standing on the shoulders of giants. The top right figure shows, besides WikiPathways itself, and the people I just mentioned, more giants. This includes Wikidata, which we previously envisioned as hub of metabolite information (see our Enabling Open Science: Wikidata for Research (Wiki4R) proposal). Wikidata allows me to solve the problem that CAS registry numbers are hard to link to chemical structures (SMILES): it has some 70 thousand CAS numbers.


SPARQL query that lists all CAS registry numbers in Wikidata, along with the matching
SMILES (canonical and isomeric), database entry, and name of the compound. Try it.
A lot more about CAS registry numbers is found in my blog.
Finally, but certainly not least, is Denise Slenter, who started this spring in our group. She picked up things I and others were doing very quickly (for example this great work from Maastricht Science Programme students), gave those her own twist, and is now leading the practical work in taking this to the next level. This new WikiPathways paper shows the fruits of her work.

Metabolism
Of course, there are plenty of other pathways database. KEGG is still the gold standard for many. And there is the great work of Reactome, RECON, and many others (see references in the NAR article). Not to mention the important resources that integrate pathways resources. To me, unique strengths of WikiPathways include the community approach, very liberal licence (CCZero), many collaborations (do we have a slide on that?), and, importantly, its expressiveness. The latter allows our group to do the systems biology work that we do, analyzing microRNA/RNASeq data, studying diseases at a molecular interaction level, see the effects of personal genetics (SNPs, GWAS), and visually integrate and summarize the combination of experimental data and text book knowledge.

OK, this post is now already long enough. And seeing from the length, you can see how much I am impressed with WikiPathways and where it goes. Clearly, there is still a lot left to do. And I am just another person contributing to the project and honored that we could give this WikiPathways paper a metabolomics spin. HT to Alex, Tina, and Chris for that!

Slenter, D. N., Kutmon, M., Hanspers, K., Riutta, A., Windsor, J., Nunes, N., Mélius, J., Cirillo, E., Coort, S. L., Digles, D., Ehrhart, F., Giesbertz, P., Kalafati, M., Martens, M., Miller, R., Nishida, K., Rieswijk, L., Waagmeester, A., Eijssen, L. M. T., Evelo, C. T., Pico, A. R., Willighagen, E. L., Nov. 2017. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Research. http://dx.doi.org/10.1093/nar/gkx1064