Pages

Sunday, February 14, 2016

Aggregating data on nanomaterials: eNanoMapper is getting closer to critical mass

Nanosafety data for silver nanoparticles
in data.enanomapper.net visualized
with ambit.js and d3.js.
The last three weeks featured two meetings around data infrastructures for the NanoSafety Cluster. The first maating was on January 25-26 in Brussels, and last week the eNanoMapper project held its second year meeting with a subsequent workshop in Basel (see the program with links to course material). Here are some personal reflections on these meetings, and some source code updates based on the latter workshop particularly.

For the workshop in Basel I extended previous work on JavaScript and R client code for the eNanoMapper API (which I previously wrote about and see doi:10.3762/bjnano.6.165).

JavaScript
Nothing much changed for ambit.js (see these two posts) and I only added a method to search nanomaterials based on chemistry rather than names before (release 0.0.3 is pending). That is, given a compound URI, you can now list all substances with this URI, using the listForCompound() function:

var searcher = new Ambit.Substance(
  "https://apps.ideaconsult.net/enanomapper"
);

var compound =
  "https://apps.ideaconsult.net/enanomapper/compound/71/conformer/71"
searcher.listForCompound(compound, processList);

You may wonder how to get the URI used in this code. Indeed, one does rather than hardcode that, as it may be different in other eNanoMapper data warehouse instances. This is where another corner of the eNanoMapper API comes in, which is wrapped by the Compound.search() method. However, I have to play more with this method before I encourage you to use it. For example, this method returns a list of compound matching the search. So, how do we search for fullerene particles?

R package
The renm package for R was demonstrated in Basel too and some 25% in the audience uses R in their research. The README.md has some updated examples, including this one to list all nanomaterials from a PNAS paper (doi:10.1073/pnas.0802878105):

library(renm)
substances <- listSubstances(
    service="http://data.enanomapper.net/",
    search="10.1073/pnas.0802878105", type="citation"
)

The 0.0.3 release made just in time of the workshop fixed a few minor issues. The above JavaScript example cannot be repeated in R yet, but this is scheduled for the next release.

Data quality
For a few materials I have now created summary pages. These should really be considered demonstrations of what a database with an API has to over, but it seems that for some materials we are slowly going towards critical mass. Better, it shows nicely what advantages data integration has: the data from silver materials comes from three different data sources, aggregated in the data.enanomapper.net instance. However, if you look at the above codes, it is easy to see how it could easily pull in data from multiple instances. For example, here is LDH release assay results for two of the JRC Representative Materials:


This, of course, taking advantage of the common language that the eNanoMapper ontology provides (doi:10.1186/s13326-015-0005-5). This ontology is now available from BioPortal, Aber-OWL, and the Ontology Lookup Service (via their great beta). Huge thanks to these projects for their work on making ontologies accessible!

But there is a long way to go. Many people in Europe and the U.S.A. are working on the many aspects of data quality. I would not say that all data we aggregated so far is of high quality; that is, it somewhat depends on the use case. The NanoWiki data that I have been aggregating (see release 2 on Figshare, doi:m9.figshare.2075347.v1) has several goals, but depending on the goal varies in quality. For example, one goal is to index nanosafety research (e.g. give me all bio assays for TiO2) in which case it is left to the user to read the discovered literature. Another goal is to serve NanoQSAR work, where I focused on accurately describing the chemistry, but have varying levels of amount of info on the bioassays (e.g. is there a size dependency for cytotoxicity).

There is a lot of discussion on data quality, as there was two years ago. I am personally of the opinion that eNanoMapper cannot solve the question of data quality. That ultimately depends on the projects recording and dissemination the data. Instead, eNanoMapper (like any other database) is just the messenger. In fact, the more people complain about the data quality, the better the system managed to community the lack of detail. Of course, it is critical to compare this to the current situation: publications in journals, and it seems to me we are well on our way to improve over dissemination of data via journal articles.

Basel
Oh, and the view from my room in the Merian Hotel was brilliant!


Wednesday, January 27, 2016

Adding chemical compounds to Wikidata

(write up in progress)

Adding chemical compounds to Wikidata is not difficult. You can store the chemical formula (P274), (canonical) SMILES (P233), InChIKey (P235) (and InChI (P234), of course), as well various database identifiers (see what I wrote about that here). It also allows storing of the provenance, and has predicates for that too.

So, to enter a new structure for a compound, you should enter the compound information to Wikidata. Of course, make sure to create the needed accounts, particularly one for Wikidata (create account) (not sure if the next steps needs a more general Wikimedia account too).

Entering the research paper
Magnus Manske pointed me to this helper tool. If you have the DOI of the paper, it is easy to add a new paper. This is what the tool shows for doi:10.1128/AAC.01148-08 (but no longer when you try!):


You need permission to run this script and the tool will alert you about that, and give the instructions how to get permission. After I clicked the Open in QuickStatements I get this output, showing me an entry in Wikidata was created for this paper:


Later, I can use the new Q-code (Q22309806) to use as source for statements about the compound (formula, etc).

Draw your compound and get an InChIKey
The next step is to draw a compound and get an InChIKey. This can be done with many tools, including Bioclipse. Rajarshi opted for alternatives:

Then check if the compound is not already in Wikidata. You can use this SPARQL query for that using the InChIKey of the compound (it's for acetic acid, so it will be found):


For convenience, here the copy/pastable SPARQL:
    PREFIX wdt: 
    SELECT ?compound WHERE {
      ?compound wdt:P235 "QTBSBXVTEAMEQO-UHFFFAOYSA-N" .
    }
    
Entering the compound
So, the compound is not already in Wikidata, so time to add it. The minimal information you should provide is the following:
  • mark the new entry as 'instance of' (P) 'chemical compound (Q)
  • the chemical formula and SMILES (use as reference the paper)
    • add the reference to the paper you entered above
  • add the InChIKey and/or InChI
The first step is to create a new Wikidat entry. The Create new item menu in the left side panel can be used, showing a page like this:


As a label you can use the name used in the paper for the compound, even if a code, and as description 'chemical compound' will do for now; it can be changed later.
Feel free to add as much information about the compound as you can find. There are some chemically rich entries in Wikidata, such as that for acetic acid (Q47512).

Wednesday, January 13, 2016

Publishing H2020 Proposals

Figure from the RIO paper.
Over a year ago Daniel Mietchen invited me to join writing a H2020 proposal around Open Science. Well, that combines two of my current worlds, so interesting indeed. But there was more: Daniel wanted to do the writing openly, and that was certainly new to me. But since I see piles of benefits in open science, this is sort of the next step. Not obvious, perhaps, but certainly a step I wanted to try.

The proposal that resulted from this was "Enabling Open Science: Wikidata for Research (Wiki4R)", as said, lead by Daniel Mietchen. It was drafted fully in the open, and we got a lot of feedback from people not involved in the anticipated consortium. Of course, we did not get it; you would have heard me about it earlier if we had.

As part of the open writing is, of course, an open license, to ensure everyone who participates has equal IP on the proposal. (Some seem to forget that an Open Access license is not giving your IP; you're just licensing it!) The final, proposal was posted on ZENODO (see below) just after submission. More recently, however, Daniel submitted it to Research Ideas and Outcomes journal (ISSN 2367-7163) (which, of course, the Open license allows too!) some weeks back, which is a new journal which covers not just the end product of some research (a research paper), but also other things, including project proposals (full reference below). Mind you, not everything in this "journal" of peer-reviewed pre-publication, and the proposal is not reviewed, indeed. Post-review is most welcome, BTW! Just head of to PubPeer or Publons and start ranting about the proposal ;)

Now, the journal seems to have blogged about this H2020 proposal publication - Daniel is involved in setting up the journal - and send it out as a press release-like thing, which is actually being picked up by news outlets :) That's new to me too.

All in all, it's an interesting experiment, and I am grateful to Daniel for having been able to be part of this. Writing H2020 proposals openly is a new phenomenon, and I cannot commit myself to use this approach for all my proposals, but I think I may do this more often in the future.

Mietchen, D., Hagedorn, G., Willighagen, E., Rico, M., Gomez-Perez, A., Aibar, E., Rafes, K., Germain, C., Dunning, A., Pintscher, L., Kinzler, D., Anonymous, Jan. 2015. Enabling open science: Wikidata for research. http://dx.doi.org/10.5281/zenodo.13906
Mietchen, D., Hagedorn, G., Willighagen, E., Rico, M., Gómez-Pérez, A., Aibar, E., Rafes, K., Germain, C., Dunning, A., Pintscher, L., Kinzler, D., Dec. 2015. Enabling open science: Wikidata for research (Wiki4R). Research Ideas and Outcomes 1, e7573+. http://dx.doi.org/10.3897/rio.1.e7573

Sunday, January 03, 2016

ELIXIR is setting up a Tools and Data Services Registry

ELIXIR is setting up a Tools and Data Services Registry. Recently, they organized a workshop in Amsterdam that I attended and where I learned how to add tools and services to their database. I played with the entry for WikiPathways, and one of the nice things is that it inherits from past European registry projects and allows the encoding if the input and output format, for tools and services alike. Here's what it gives for WikiPathways now:


The record editing facility is pretty straightforward and uses a number of tabs where you can add information.

A summary:

The publications:

  


Where documentation is found:


And information would is not really supplementary, such as the license terms:


Here, the collections are of particular interest. During the meeting, a few people from the Dutch Techcenter for Life Sciences decided to use a ELIXIR-NL group for all Dutch services that benefit the full ELIXIR network. Furthermore, the BIGCAT-UM collection was set up to indicate all services by our research group, which may eventually serve is a folder towards supporting the Dutch ZonMW Enabling Technologies Hotels calls.

Mind you, the registry can distinguish various services. The above entry is for the web interface, not for the web services. That entry in the registry is not that well populated yet, and that's for a reason. (Actually, more than one, one being that I did not create that entry and cannot change it).

But the WikiPathways Webservices are nicely exposed via a Swagger configuration file. Moreover, the registry supports JSON too, export and import. The format is pretty simply and we only need to create a Swagger 2.0 config file convertor. I just need to find a bit of time to finish my draft implementation.

Open Spectral Database

Stuart Chalk wrote on the CHMINF-L mailing list about Open Spectral Database (OSDB). This new database is more of an idea than something with critical mass yet. But the idea seems right: it has a CCZero waiver for the data, is Open Source (see github.com/stuchalk/OSDB), and API. The webinterface looks good too:


It supports various spectral types and maybe it can be seeded with data from one of the Massbank instances. That said, it does seem popular enough to already attract some spamming in the collections corner; that also means, it needs curators that keep an eye on what enters. Perhaps register via ORCID may be an option to fight spam, but I do not have experience with setting that up. Other feature requests I can think of is links out to Wikidata, in addition to the existing three databases.

Now I really have a good reason to dig out my past NMRShiftDB contributions and submit that here (see also these past blog posts about NMRShiftDB).