Tuesday, May 21, 2019

Scholia: an open source platform around open data

Some 2.5 years ago Finn Nielsen started Scholia. I have been blogging about it a few times, and thanks to Finn, Lane Rasberry, and Daniel Mietchen, we were awarded a grant by the Alfred P. Sloan Foundation to continue working on it (grant: G-2019-11458). I'll tweet more about how it fits the infrastructure to support our core research lines, but for now just want to mention that we published the full proposal in RIO Journal.

Oh, just as a teaser and clickbait, here's one of the use cases. dissemination of knowledge of metabolites and chemicals in general (full poster):

Saturday, May 18, 2019

LIPID MAPS: mass spectra and species annotation from Wikidata

 Part of the LIPID MAPS classificationscheme in Wikidata (try it).
A bit over a week I attended LIPID MAPS Working Group meeting in Cambridge, as I have become member of the Working Group 2: Tools and Technical Committee in autumn. That followed a fruitful effort by Eoin Fahy to make several LIPID MAPS pathways available in WikiPathways (see this Lipids Portal), e.g. the Omega-3/Omega-6 FA synthesis pathway. It was a great pleasure to attend the meeting, meet everyone, and I learned a lot about the internals of the LIPID MAPS project.

I showed them how we contribute to WikiPathways, particularly in the area of lipids. Denise Slenter and I have been working on having more identifier mappings in Wikidata, among which the lipids. Some results of that work was part of this presentation. One of the nice things about Wikidata is that you can make live Venn diagrams, e.g. compounds in LIPID MAPS for which Wikidata also has a statement about which species it is found in (try it):

SELECT ?lipid ?lipidLabel ?lmid ?species ?speciesLabel
?source ?sourceLabel ?doi
WHERE {
?lipid wdt:P2063 ?lmid ;
p:P703 ?speciesStatement .
OPTIONAL {
?speciesStatement prov:wasDerivedFrom/pr:P248 ?source ;
ps:P703 ?species .
OPTIONAL { ?source wdt:P356 ?doi }
}
SERVICE wikibase:label {
bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
}
}


A second query searches lipids for which also mass spectra are found in MassBank (try it):

SELECT
?lipid ?lipidLabel ?lmid
(GROUP_CONCAT(DISTINCT ?massbanks) as ?massbank)
WHERE {
?lipid wdt:P2063 ?lmid ;
wdt:P6689 ?massbanks .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".
}
} GROUP BY ?lipid ?lipidLabel ?lmid


Enjoy!

Saturday, May 04, 2019

Wikidata, CompTox Chemistry Dashboard, and the DSSTOX substance identifier

The US EPA published a paper recently about the CompTox Chemistry Dashboard (doi:10.1186/s13321-017-0247-6). Some time ago I worked with Antony Williams and we proposed a matching Wikidata identifier. When it was accepted, I used a InChIKey-DSSTOX identifier mapping data sets by Antony (doi:10.6084/M9.FIGSHARE.3578313.V1) to populate Wikidata with links. Overtime, when more InChIKeys were found in Wikidata, I would use this script to add additional mappings. That resulted in this growth graph:

 Source: Wikidata.
Now, about a week ago Antony informed me he worked with someone of Wikipedia to have the DSSTOX automatically show up in the ChemBox, which I find awesome. It's cool to see your work on about 38 thousand (!) Wikipedia pages :)
 Part of the ChemBox of perfluorooctanoic acid.
(I'm making the assumption that all 38 thousand Wikidata pages for chemicals have Wikipedia equivalents, which may be a false assumption.)

Wednesday, April 24, 2019

Open Notebook Science: the version control approach

Jean-Claude Bradley pitched the idea of Open Notebook Science, or Open-notebook science as the proper spelling seems to be. I have used notebooks a lot, but ever since I went digital, the use went down. During my PhD studies I still extensively used them. But in the process, I changed my approach. Influenced by open source practices.

After all, open source has had a long history of version control, where commit messages explain the reason why some change was made. And people that ever looked at my commits, know that my commits tend to be small. And know that my messages describe the purpose of some commit.

That is my open notebook. It is essential to record why a certain change was made and what exactly that change was. Trivial with version control. Mind you, version control is not limited to source code. Using the right approaches, data and writing can easily be tracked with version control too. Just check, for example, my GitHub profile. You will find journal articles been written, data collected, just as if they were equal research outputs (they are).

Another great example of version control for writing and data is provided by Wikipedia and Wikidata. Now, some changes I found hard to track there: when I asked the SourceMD tool (great work by Magnus Manske) to create items for books, I want to see the changes made. The tool did link to the revisions made at some point, but this service integration seems to break down now and then. Then I realized that I could use the EditGroups tool directly (HT to who wrote that), and found this specific page for my edits, which includes not just those via SourceMD but also all edits I made via QuickStatements (also by Magnus):

If only I could give a "commit message" which each QuickStatements job I run. Can I?

Saturday, April 13, 2019

Bioclipse on the command line

 Screenshot of Bioclipse 2.
Over the past seven years there has been a lot of chemistry interoperability work I have done using Bioclipse (doi:10.1186/1471-2105-8-59, doi:10.1186/1471-2105-10-397). The code is based on Eclipse, which gives a great GUI experience, but also turned out hard to maintain. Possibly, that was because of a second neat feature, that you could plugin libraries into the Python, JavaScript, and Groovy scripting environment which allows people to automate things in Bioclipse. Over the course of time, so many libraries have been integrated, making so many scientific toolkit available at the tip of your fingers. Of the three programming languages, I have used Groovy the most, being close to the Java language, but with with a lot of syntactic goodies.

In fact, I have blogged about the scripts I wrote on my occasions and in 2015 I wrote up a few blog posts on how to install new extensions:

But publishing and installing new Bioclipse 2.6.2 extension remained complicated (installing Bioclipse itself it quite trivial). And that while the scripts are so useful, and I need others to start using them. I do not scale. Second, when I cite these scripts, they were too hard to use by reviewers and readers. To get some idea of a small subset of the functionality, read our book A lot of Bioclipse Scripting Language examples.

So, last x-mas I set out with the wish to be able to have others much more easily run my scripts and, second, be able to run them from the command line. To achieve that, installing and particularly publishing Bioclipse extensions had to become much easier. Maybe as easy of Groovy just Grab-bing the dependencies from the script itself. So, Bioclipse available from Maven Central, or so.

Of course, this approach would likely loose a lot of wonderful functionality, like the graphical UX, the plugin system, the language injection, and likely more. So, one important requirements was that any script using the command line must be identical to the script in Bioclipse itself. Well, with a few permissible exceptions: we are allowed to inject the Bioclipse managers manually.

Well, of course, I would not have been blogging this had I not succeeded to reach these goals in some way. Indeed, following up from a wonderful metaRbolomics meeting organized by de.NBI (~ ELIXIR Germany), and the powerful plans discussed with Emma Schymanski (and some ongoing work of persistent toxicants), and, fairly, actually not drowning in failed deadlines, just regularly way behind deadlines, and since I have a research line to run, I dived into hackmode. In some 14 hours, mostly in the evening hours of the past two days, I got a proof of principle up and running. The name is a reference to all the wonderful linguistic fun we had when I worked in Uppsala, thanks to Carl Mäsak, e.g. discussing the term Bioclipse Scripting Language and Perl 6.

It is not available yet from Maven Central, so there is a manual mvn clean install involved at this moment, but after that (the command installs it in your local Maven repository which will be recognized by Groovy), you can get started with something like (I marked in blue to extra sugar needed on the command line; the black code runs as is in Bioclipse 2.6.2):

@Grab(
group='net.bioclipse.managers',
module='bioclipse-cdk',
version='0.0.1-SNAPSHOT'
)

workspaceRoot = "."
def cdk = new net.bioclipse.managers.CDKManager(workspaceRoot);

list = cdk.createMoleculeList()
println list
println cdk.fromSMILES("COC")

What now?
In the future, once it is available on Maven Central, you will be able to skip the local install command, and @Grab will just fetch things from that online repository. I will be tagging version 0.0.1 today, as I got my important script running that takes one or more SMILES strings, checks Wikidata, and makes QuickStatements to add missing chemicals. The first time you've (maybe) seen that, was three years ago, in this blog post.

You may wonder: why?? I asked myself the same thing, but there are a few things over the past 24 hours that I could answer and which may sketch where this is going:

1. that BSL book can actually show running the code and show the output in the book, just like with my CDK book;
2. maybe we can use Bioclipse managers in Nextflow;
3. Bioclipse offers interoperability layers, allowing me to pass a chemical structure from one Java library to another (e.g. from the CDK to Jmol to JOELib);
4. it allows me to update library versions without having to rebuild a full new Bioclipse stack (I'm already technically unable, let alone timewise unable);
5. I can start sharing Bioclipse scripts with articles that people can actually run; and,
6. all scripts are compatible, and all extensions I make can be easily copied into the main Bioclipse repository, if there ever will be a next major Bioclipse version (seems unlike now).

Now, it's just being patient and migrating manager by manager. It may be possible to use the the existing manager code, but that comes with so much language injection, that I decided to just take advantage of Open Science and just copy/paste the code. Most of the code is the same, minus progress monitors, and replacing Eclipse IFile code with regular Java code. But there are tons of managers, and reaching even 50% coverage will take, at the speed I can offer, months. Therefore, I'll focus on scripts I share with others, focus on reuse and reproducibility.

More soon!