Thursday, July 02, 2020

Bioclipse git experiences #2: Create patches for individual plugins/features

Carrying around git patches is hard work.
Source: Auckland War Memorial Museum, CC-SA.
This is a series of two posts repeating some content I wrote up back in the Bioclipse days (see also this Scholia page). They both deal with something we were facing: restructuring of version control repositories, while actually keeping the history. For example, you may want to copy or move code from one repository to another. A second use case can be a file that must be removed (there are valid reasons for that). Because these posts are based on Bioclipse work, there will be some specific terminology, but the approach I regularly apply in other situations.

This second post talks about how to migrate code from one repository to another.

Create patches for individual plugins/features

While the above works pretty well, a good alternative in situations where you only need to get a repository-with-history for a few plugins, is to use patch sets.
  • first, initialize a new git repository, e.g. bioclipse.rdf:
 mkdir bioclipse.rdf
 cd bioclipse.rdf
 git init
 nano README
 git commit -m "Added README with some basic info about the new repository" README
  • then, for each plugin discover you need what the commit was where the plugins was first commited, using the git-svn repository created earlier:
 cd your.gitsvn.checkout
 git log --pretty=oneline externals/com.hp.hpl.jena/ | tail -1
  • then create patches for the last tree before that last patch by appending '^1' to the commit hash. For example, the first patch of the Jena libraries was 06d0eb0542377f958d06892860ea3363e3316389, so I type:
 rm 00*.patch
 git format-patch 06d0eb0542377f958d06892860ea3363e3316389^1 -- externals/com.hp.hpl.jena
(tune the filter when removing old patches if there are more than 99!)
The previous two steps can be combined into a Perl script:
use diagnostics;
use strict;

my $plugin = $ARGV[0];

if (!$plugin) {
  print "Syntax: gfp <plugin|feature>\n";

die "Cannot find plugin or feature $plugin !" if (!(-e $plugin));

`rm -f *.patch`;
my $hash = `git log --follow --pretty=oneline $plugin | tail -1 | cut -d' ' -f1`;
$hash =~ s/\n|\r//g;

print "Plugin: $plugin \n";
print "Hash: $hash \n";
`git format-patch $hash^1 -- $plugin`;
  • move these patches into your new repository:
 mv 00*.patch ../bioclipse.rdf
(tune the filter when moving the patches if there are more than 99! Also customize the target folder name to match your situation)
  • apply the new patches in your new git repository:
 cd ../bioclipse.rdf
 git am 00*.patch
(You're on your own if that fails... and you may have to default to the other alternative then)
  • repeat those two steps for all plugins you want in your new repository

Bioclipse git experiences #1: Strip away unwanted plugins

This is a series of two posts repeating some content I wrote up back in the Bioclipse days (see also this Scholia page). They both deal with something we were facing: restructuring of version control repositories, while actually keeping the history. For example, you may want to copy or move code from one repository to another. A second use case can be a file that must be removed (there are valid reasons for that). Because these posts are based on Bioclipse work, there will be some specific terminology, but the approach I regularly apply in other situations.

For this first post, think of a plugin as a subfolder, tho it even applies to files.

Strip away unwanted plugins

  • then you remove everything you do not want in your new git repository. Do:
 git clone --bare --no-hardlinks old.local.clone/ new.local.clone/
then use:
 git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch plugins/net.bioclipse.actionHistory plugins/net.bioclipse.analysis' HEAD
It often happens that you need to run the above command several times, in cases when there are many subdirectories to be removed.
When you removed all the bits you need removed, you can clean up the repository and reduce the size considerably with:
 git repack -ad; git prune

Thursday, May 07, 2020

new project: "COVID-19 Disease Maps"

Project logo by Marek Ostaszewski.
Already started a few weeks ago, but the COVID-19 Disease Maps project now has a sketch published, outlining the ambitions: COVID-19 Disease Map, building a computational repository of SARS-CoV-2 virus-host interaction mechanisms  (doi:10.1038/s41597-020-0477-8).

I've been focusing on the experimental knowledge we have about the components of the SARS-CoV-2 virion and how they interact with the human cell. I'm at least two weeks behind on reading literature, but hope to catch up a bit this week. The following diagram shows one of the pathways on the WikiPathways COVID-19 Portal:

wikipathways:WP4846, CC0
This has led to collaborations with Andra Waagmeester, Jasper Koehorst and others, resulting in this preprint that needs some tweaking before submission, to an awesome collaboration with Birgit Meldal around the Complex Portal (preprint pending), and a Japanese translation of a book around a number search queries against Wikidata (CC-BY/CC0). The latter two were started at the recent online BioHackathon.

Oh, boy, do I love Open Science.

Monday, April 27, 2020

new paper: "NanoSolveIT Project: Driving nanoinformatics research to develop innovative and integrated tools for in silico nanosafety assessment"

Fig. 1. Schematic overview of the workflow for toxicogenomics
modelling and how these models feed into the subsequent
materials modelling and IATA. Open Access.
NanoSolveIT is a H2020 project that started last year. Our BiGCaT group is involved in the data integration to support systems biology part of the Integrated Approaches to Testing and Assessment (IATA) for engineered nanomaterials in Work Package 1. This paper gives an overview of the project, the work, and the goals.

Of course, doing this is not trivial at all. And we have to bridge a lot of different research data, concepts, etc. As such, it is clear how it relates to the other nanosafety projects we have been involved in, such as eNanoMapper, NanoCommons, and RiskGONE.

Sunday, March 29, 2020

Tackling SARS-CoV-2 with big data

This blog post will contain a translation I made of this short "our story" Coronavirus te lijf met big data at the MUMC+ website written by André Leblanc. The Maastricht University Medical Center+ (MUMC+) is a collaboration of our our Maastricht University Faculty of Health, Medicine, and Life Science, of which our BiGCaT research group is part.

Wikidata is a community project and I only use and contribute to it. Scholia is a project started by Finn Nielsen (Technical University of Denmark - DTU), and now has funding from the Alfred P. Sloan Foundation, coordinated by Daniel Mietchen and Lane Rasberry (University of Virginia). Further acknowledgements to Andra Waagmeester (Micelio) and Jasper Koehorst (Wageningen University) for a great collaboration on corona virus information (see also Wikidata:WikiProject_COVID-19). WikiPathways colleagues including, of course, Prof. Chris Evelo and Dr. Martina Kutmon in Maastricht, but also Dr. Alex Pico and others in San Francisco. For me it was one of the selling points of the research group when I joined in 2012.

Tackling the corona virus with big data

Scholars around the world are working relentlessly on the development of a vaccine against the new SARS-CoV-2 coronavirus. Chemist and assistant professor Egon Willighagen contributes in collaboration with colleagues at the BiGCaT Department of Bioinformatics in Maastricht to make data and knowledge easier to find for other scholars. How does that work?

Big data is the new buzz words in the scholarly community. For example, collecting worldwide data around the treatment of cancer, and extracting from the best personal, unique treatment. In the case of the new coronavirus there is a more general need to just have access to data. Since the virus outbreak in Wuhan, China, there has been an explosion of new research articles on the COVID19 and the causing SARS-CoV-2 virus. The total number of scientific publications about corona viruses itself has reached some 29 thousand. These are not only about the new virus, but also the corona viruses that roamed the world before, like SARS and MERS. Either way, this makes it practically impossible to read all these articles. Instead, access to this literature has to be provided in a different way, allowing researchers to find the knowledge and data they need for their research.

Willighagen does this by organizing scientific literature, linking information, and filtering the collection of data and publications, making it searchable for scholars. He annotates publications with search terms and author names, and uses unique, global identifiers (like personal identification numbers) to support this. This is not unlike the use of phone numbers or dictionaries.

Various tools

Wikidata is the database used by Willighagen to link the information resources, along with Scholia to visualize the results. For example, Wikidata organizes data around the new virus with the entry. Willighagen uses these two tools to visualize what this database knows about specific topics.

Research can take advantage of a new open access resource edited by Willighagen: Also social media are used: Twitter is used to increase awareness and mobilize people. Willighagen: "That is from a personal motivation. I tweet articles that show important changes. Or if they emphasize aspects that show how unique and urgent the situation". And finally there is WikiPathways, a project initiated by colleagues of Willighagen, to collect even more specific knowledge about the COVID19 virus. Here's the pathway about the SARS-CoV-2 virion:

Thursday, March 19, 2020

new paper: "Wikidata as a knowledge graph for the life sciences"

A figure from the article, outlining the idea
of using SPARQL queries to extract data
from the open knowledge base.
As a reader of my blog, you know I have been doing quite some research where Wikidata has some role. I am preparing a paper on the work I have done around chemicals in Wikidata, based on what I presented at the ICCS with a poster. So, I was delighted when Andra and Andrew asked me to contribute to a paper outline the importance of Wikidata to the life sciences. The paper was published in eLife, which I'm excited about to, as they do a significant amount of publishing innovation.

I'll keep this post brief, as I have plenty of work to do, among which is SARS-CoV-2 data in Wikidata. Join this project, after you read the paper: Wikidata as a knowledge graph for the life sciences (doi:10.7554/eLife.52614, or in Scholia):

I'll write up some more queries for this eBook now: Wikidata Queries around the SARS-CoV-2 virus and pandemic.

Sunday, March 15, 2020

SARS-CoV-2, stuck at home, flu, and snowstorms

Scholia linking articles about the COVID19 disease.
Okay, okay, the snowstorm was ten years ago, when we were living in Sweden. We had two snowstorms, each time stuck at home, unable to leave our house. That was okay. We knew the next days the streets were cleaned, and we could continue living our lives.

Now it's different. I've been in 'social distancing' mode since the evening of Friday the 6th, so a bit over a week now. Because I have a flu. Presumably. Testing for SARS-CoV-2 is not routinely done and saved for risk groups and patients with severe COVID19 symptoms.

But the current situation is once in a lifetime. In the bad way. My generation has not had a situation like this yet. A real national emergency. But The Netherlands is coping. The data is scary. The situation in North Italy shows that humans are humans, and the virus doesn't care where it is surviving. It is how each country deals with it. And let me make clear, we must be learning from the countries that have been in the fire line already.

(North) Italy has a health care system in the top 5% according to OECD guidelines. Still, they were taken by surprise. But even the warned countries have been hesitant. The discussion is complex. A smaller economy (a 1% shrink is estimated right now) also means (as a Dutch professor pointed out 2, 3 days ago) there is less tax money to spend on the health care system.

Sad fact is, where are no longer talking about how to stop SARS-CoV-2. We are now talking about minimizing the number of causalities. A storm it is.

Keep safe, keep electronically in contact with the people around you (mental health), and foremost, wash your hands and practice social distancing. Let the storm not grow much further. This storm is not over the next morning. We're in for a rough ride.

Saturday, January 25, 2020

MetaboEU2020 in Toulouse and the ELIXIR Metabolomics Community assemblies

This week I attended the European RFMF Metabomeeting 2020, aka #MetaboEU2020, held in Toulouse. Originally, I had hoped to do this by train, but that turned out unfeasible. Co-located with this meeting where ELIXIR Metabolomics Community meetings. We're involved in two implementation studies for together less than a month of work. But both this community and the conference are great places to talk about WikiPathways, BridgeDb (our website is still disconnected from the internet), and cheminformatics.

Toulouse was generally great. It comes with its big city issues, like fairly expensive hotels, and a very frequent public transport system. It also had a great food market where we had our "gala dinner". Toulouse is also home to Airbus, so it was hard to miss the Beluga:

The MetaboEU2020 conference itself had some 400 participants, of course, with a lot of wet lab metabolomics. As a chemist, with a good pile of training in analytical chemistry, it's great to see the progress. From a data analysis perspective, the community has a long way to come. We're still talking about known known, unknown knowns, and unknown unknowns. The posters were often cryptic, e.g. stating they found 35 interesting metabolites, without actually listing them. The talks were also really interesting.

Now, if you read this, there is a good chance you were not at the meeting. You can check the above linked hashtag for coverage on Twitter, but we can do better. I loved Lanyrd, but their business model was not scalable and the service no longer exists. But Scholia (see doi:10.3897/rio.5.e35820) could fill the gap (it uses the Wikidata RDF and SPARQL queries). I followed Finn's steps and created a page for the meeting and started associated speakers (I've done this in the past for other meetings too):

Finn also created proceedings pages in the past, which I also followed. So, I asked people on Twitter to post their slidedeck and posters on Figshare or Zenodo, and so far we ended up with 10 "proceedings" (thanks to everyone who did!!!):

As you can see, there is an RSS feed which you can follow (e.g. with Feedly) to get updates if more materials appears online! I wish all conferences did this!

Thursday, January 16, 2020

Help! Digital Object Identifiers: Usability reduced if given at the bottom of the page

The (for J. Cheminform.) new SpringerNature article template has the Digital Object Identifier (DOI) at the bottom of the article page. So, every time I want to use the DOI I have to scroll all the way down to the page. That could be find for abstracts, but totally unusable for Open Access articles.

So, after our J. Cheminform. editors telcon this Monday, I started a Twitter poll:

Where I want the DOI? At the top, with the other metadata:
Recent article in the Journal of Cheminformatics.
If you agree, please vote. With enough votes, we can engage with upper SpringerNature manager to have journals choose where they want the DOI to be shown.

(Of course, the DOI as semantic data in the HTML is also important, but there is quite good annotation of that in the HTML <head>. Link out to RDF about the article, is still missing, I think.)