chem-bla-ics

Please change your RSS URL of this blog

2024-07-30T16:11:00.000+02:00

Hi all, as posted about a year ago, I moved this blog to a different domain and different platform. Noting that I still have many followers on this domain (and not on my new domain, including over 300 on Feedly.com along). So, please update your RSS/Atom reader with the following info:

blog website: https://chem-bla-ics.linkedchemistry.info/
RSS URL: https://chem-bla-ics.linkedchemistry.info/feed.xml

Last post here / the Freebie model online

2023-08-18T08:39:00.001+02:00

This is my last post on blogger.com. At least, that is the plan. It has been a great 18 years. I like to thank the owners of blogger.com and Google later for providing this service. I am continuing the chem-bla-ics on a new domain: https://chem-bla-ics.linkedchemistry.info/

I, like so many others, struggle with choosing open infrastructure versus the freebie model. Of course, we know these things come and go. Google Reader, FriendFeed, Twitter/X (see doi:10.1038/d41586-023-02554-0). My new blog is still using the freebie model: I am hosting it on GitHub. But following the advice from a fellow cheminformatician, I now front this with a owned domain name.

See you at linkedchemistry.info!

Boiling points in Wikidata

2023-08-12T12:40:00.002+02:00

Some days ago, I started added boiling points to Wikidata, referenced from Basic Laboratory and Industrial Chemicals (wikidata:Q22236188), David R. Lide's 'a CRC quick reference handbook' from 1993 (well, the edition I have). But Wikidata wants pressure (wikidata:P2077) info at which the boiling point (wikidata:P2102) was measured. Rightfully so. But I had not added those yet, because it slows me and can be automated with QuickStatements.

I just need a few SPARQL queries to list to which statements the qualifiers needs to be added. Basically, all boiling points which has the book as a reference and that do not have the pressure info. First, there are values with 'unknown value', which results in blank nodes (by the time you read this, they likely are already fixed):

SELECT ?cmp ?bp ?pressure WHERE {
?cmp p:P2102 ?bpStatement .
?bpStatement prov:wasDerivedFrom/pr:P248 wd:Q22236188 ;
ps:P2102 ?bp .
?bpStatement pq:P2077 ?pressure .
FILTER (contains(str(?pressure), "http://"))
}

So, to get the list for which I want to write the QuickStatements which does not have any P2077 qualifier yet, I use this query:

SELECT ?cmp WHERE {
?cmp p:P2102 ?bpStatement .
?bpStatement prov:wasDerivedFrom/pr:P248 wd:Q22236188 ;
ps:P2102 ?bp .
MINUS { ?bpStatement pq:P2077 ?pressure }
}

At the time of writing, this lists 54 boiling points.

I can the WDQS create CSV-styled QuickStatements with:

SELECT (SUBSTR(STR(?cmp),32) AS ?qid) ?P2102 ?qal2077 WHERE {
?cmp p:P2102 ?bpStatement .
?bpStatement prov:wasDerivedFrom/pr:P248 wd:Q22236188 ;
ps:P2102 ?P2102 .
MINUS { ?bpStatement pq:P2077 ?pressure }
BIND ("101.325U21064807" AS ?qal2077)
}

Here, the SPARQL variables double as QuickStatement instructions. Finally, note to use of "U21064807" which is the Wikidata item for kilopascal (wikidata:Q21064807).

I also need to "add" the boiling point again, to make sure QuickStatements knows which statement to add the qualifier to. I think this can be done better, but not sure how to target statements directly. This is not fool proof: I noted that this approach ignores the situation where there are two statements with the (exact) same boiling point, but different error margins. But that I will monitor and where needed correct manually.

History, provenance, detail

2023-08-08T08:12:00.006+02:00

Just a quick note: I just love the level of detail Wikidata allows us to use. One of the marvels is the practices of 'named as', which can be used in statements for subject and objects. The notion and importance here is that things are referred to in different ways, and these properties allows us to link the interpretation with the source. For example, Max Born's seminal work Zur Quantenmechanik (doi:10.1007/BF01328531) uses a very short notation to cite other literature, as footnotes, and DOIs did not exist yet.

So, in Wikidata, you can capture this like this:

Blog planets: blogging about Debian, GNOME, Wikimedia, FSFE, and many more

2023-08-04T09:36:00.006+02:00

I am still an avid user of RSS/Atom feeds. I use Feedly daily, partly because of their easy to use app. My blog is part of Planet RDF, a blog planet. Blog planets aggregate blogs from many people around a certain topic. It's like a forum, but open, free, community driven. It's exactly what the web should be.

It turned out that planets do still exist, so I started a small corner on Wikidata: Q121134938, and a number of existing blog planets:

The software used to run these planets is ancient, though. We need a new generation of software, replacing things like Planet. And I want something people can easily host on GitHub or GitLab Pages or the likes.

I created a minimal shape expression but the Wikidata items for the planets still lack a lot of information that can be added. First, we can think of them as venues, perhaps, where people "publish" their work. Second, we can annotate the blog planets with 'main subject' for the topics the cover. Or we can list the people that are "author" on the planet; most planets are very transparent about which blogs they aggregate.

Love to see where this is going. Who knows? Maybe we will see Postgenomic (see doi:10.1186/1471-2105-8-487) and Chemical blogspace resurface :)

Archiving and updating my blog

2023-07-27T11:24:00.000+02:00

This blog is almost 18 years old now. I have long wanted to migrate it to a version control system and at the same time have more control over things. Markdown would be awesome. In the past year, I learned a lot about the power of Jekyll and needed to get more experienced with it to use it for more databases, like we now do for WikiPathways.

So, time to migrate this blog :) This is probably a multiyear project, so feel free to continue reading it hear. Why? Because I start with the old posts :) Along the way, I am fixing things, improving it. I still have plenty on my todo list, but already happy with having learned Font Awesome, which makes it easy to annotate with how I fixed broken links (or not). I now use three icons: a box for when I use the Internet Archive (they can use your donation); a 'recycle' icon when I found a new URL for the same page; and a broken URL link for other situations.

This is what it looks like:

Universities and open infrastructures

2023-07-07T20:00:00.004+02:00

The role of a university is manifold. Being a place where people can find knowledge and the track record how that knowledge was reached is often seen as part of that. Over the past decades universities outsources this role, for example to publishers. This is seeing a lot of discussion and I am happy to see that the Dutch Universities are taking back control fast now. For example, Radboud University (>1k followers) already joined the Fediverse (Mastodon etc), making them independent from non-EU law and commercial interests. Scientific journals, Nobel Prize winners, etc already joined too, btw.

This effort is calling for more universities to go into the direction of open infrastructures. I am looking forward to seeing all Dutch Universities post news on Mastodon, post videos on PeerTube, etc.

Would it not be awesome if the Fediverse would become the new multidimensional knowledge dissemination and peer review system we have all been waiting for?

Update: universities with a Mastodon listed in Wikidata on the world map: https://w.wiki/6zR3

Journal Rankings

2023-07-06T11:56:00.005+02:00

I am pleased to learn that the Dutch Universities start looking at rankings of a more scientific way. It is long overdue that we take scientific peer review of the indicators used in those rankings seriously, instead of hiding beyond fud around the decline of quality of research.

So, what defines the quality of a journal? Or better, of any scholarly dissemination channel? After all, some databases do better peer review than some journals. Sadly, I am not aware of literature that compares the quality of peer review in databases with that in scientific journals. Also long overdue, in my opinion.

I hope the Open Science community will help shape these scholarly dissemination channels, journals included. Some ideas, the outlet:

encourages post-publication peer review
communicates the post-publication peer review
allows updating easily small fixes and clarifications (no hiding behind the version-of-record)
ensures supp info / additional files undergo the same level of peer review
use modern solutions for communication (like semantic web technologies)
have clear licenses for all aspects of the research output
actively fight against visual representation only, but provides all data
guarantees that supp info / additional files are archived, as the output itself
adopts, promotes, requires community standards (including global, unique identifiers)

Okay, these items are pretty broad. Many of them are part of FAIR, but that should not surprise you, because FAIR are just applying traditional scholarly approaches, like properly keeping notebooks. It's just a bit more "digital" then we have been taught.

Do we know how to do this? Yes, pretty much. This is not a technical exercise, but one of social change and particularly willingness. Basically, if you want to keep the current way of doing things, the declare you want unreproducible, low quality research reporting. That's your academic freedom, of course. If I were a funder or a university, I would also expect a bit more in return for my money.

Let me stress, glossy articles are fine! You do not have to stop that. Media appearances, key notes, these are all also fine. They are, however, complementary. We should not continue the habit of fancy narratives as replacement for quality research dissemination. Do both, if you must.

Qeios, an open dissemination platform for research output

2023-07-02T11:02:00.001+02:00

A bit over a year ago I got introduced to Qeios when I was asked to review an article by Michie, West, and Hasting: "Creating ontological definitions for use in science" (doi:10.32388/YGIF9B.2). I wrote up my thoughts after reading the paper, and the review was posted openly online and got a DOI. Not the first platform to do this (think F1000), but it is always nice to see some publishers taking publishing seriously. Since then, I reviewed two more papers.

One of these latter two was not a more traditional paper, but a different kind of research output: a definition, about "Drive-by Curation" (doi:10.32388/KBX9VO). Now about this output type, collaboratively working on definitions is something core to ontology development (e.g. see doi:10.1186/s13326-015-0005-5), but there is a clear need to discuss terminology. The GRACIOUS project in the EU NanoSafety Cluster also recognized this and set up a tool for this, their Terminology Harmonizer (doi:10.1016/j.impact.2021.100366).

This GRACIOUS tool, much more than what Qeios does, helps users. Unfortunately, and why how these topics nicely come together, writing definitions, thinking about when some zeta potential is different from another zeta potential, and the (drive-by) community curation, it needs transparency. I understand it, but landing on a login page is for me a recipe for a silent death as it disallows people to learn, without making an (time) investment first. That is what Qeios does differently: it is more FAIR.

So, that brings me to my last point in this post. Jente Houweling and I wrote up a definition for "Research Output Management" (doi:10.32388/ZNWI7T), based on our discussions about her research insights. See the screenshot below.

It has been reviewed internally, and by one independent peer (doi:10.32388/C3SJTN). But we would love to hear your review too. Just follow the instructions online. We are looking forward to reading your thoughts and to refining our definition.

Twitter exits FAIR and is no longer a dissemination solution

2023-07-01T08:51:00.008+02:00

Update: Musk said this was a temporary measure. The problem was scraping of content, you know, the content we openly share on Twitter. Maybe they could have done this with APIs. Oh wait, they closed those behind a very expensive paywall. Update 2: Another rumor is that the forgot to make a deal with a cloud provider and suddenly were left with a fraction of the computing power.

And just like that, without a warning, Twitter changed policies again, and you now need a Twitter account and be logged in to see public tweets: Twitter has started blocking unregistered users (The Verge). Though I learned it first via Mastodon, of course.

For example, this is what happens when you go to twitter.com/wikipathways:

Fortunately, WikiPathways does have a Mastodon account, that anyone can see without having a Mastodon account. You can even follow WikiPathways's account with its RSS feed. Dissemination should not be paywalled.

Maybe Musk has been talking to Elsevier and Springer Nature.

Tip: Finding Mastodon accounts with Wikidata (a few SPARQL queries)

Community activity #2: FAIRsharing

2023-06-11T21:04:00.005+02:00

Some years ago we started the ELIXIR Toxicology Community. It has been an interesting journey, partly covered in this whitepaper). We started with interaction we had in several projects already, but particularly the potential. I see this. This series of posts is a number of things toxicology projects can do to benefit from ELIXIR solutions ("services"). The posts have been sent first to the ELIXIR Toxicology Community mailing list (please join!).

History

In this post, let's look at FAIRsharing. It is "A curated, informative and educational resource on data and metadata standards, inter-related to databases and data policies" [0,1].

The ELIXIR Toxicology Community (we) maintains the toxicology corner of this database and members of our community have been adding toxicology-related databases, relevant standards. On the side of the policies we are falling a bit short: fairsharing.org/Toxicology.

Why adopt FAIRsharing

FAIRsharing is one place where metadata can be shared about your databases. It helps make your resources and research more FAIR and explains people how your work relates to other work (fairsharing.org/graph/3496):

What you can do

Get an account (with your ORCID or GitHub account) and add resources important to your research, your projects, your work generally. Particularly, (data) policies and standards you are expected to comply with are useful. Also, links between various resources. For example, if some (project) database complies with an important policy or standards, this is worth seeing show up.

Alternatively, join the ELIXIR Toxicology Community mailing list and post the missing resource there, or use our issue tracker at github.com/elixir-europe/toxicology-community/issues/.

Let's make toxicology more FAIR.

0.https://www.nature.com/articles/s41587-019-0080-8

1.https://scholia.toolforge.org/work/Q64084285

Information Retrieval versus ChatGPT

2023-05-31T07:53:00.001+02:00

When last week in a large (and relevant) Dutch research event ChatGPT came up, and that this was going to change the world. Even the critiques came up, but were effectively disregarded with "these methods get better very quickly". This is not untrue, but not really true either. I murmur "not even wrong". I know how hard it is to get computers to find meaningful patters; I did a PhD in this in the early 21st century.

What strikes me, is that ChatGPT is now pitches as an informational retrieval (IR) system. This is a system where it tries to find information, that is, it "retrieves" information form a knowledge base. Like SQL or SPARQL. Or like Google Maps. IR about reproducing existing knowledge.

Now, deep learning starts with a different premise: we can find the patterns and in this way compress an unlimited number of facts into a mathematical equation, a physical law. That way, you do not have to record if the sun comes up every day. We predict it does. We do not have to record that rain drop will fall (that they do. when they do that actually is something to record). At best, we would record when rain drops start "falling" to the sky. That is, we have the laws of gravitation.

But here lies the problem with systems like ChatGPT: they are as good as their predictive patterns they learned. But they do not retrieve information. They predict information. This is why it doesn't know about references. It lost the link between predictions and on which shelf the the book was stored.

So, when last week the research event mentioned that lawyers were starting to use it, citing existing work, I was skeptical: that would actually mean they moved ChatGPT into IR. And I already had learned (*) that ChatGPT would predict references, rather than look them up. It's a prediction method, not an IR method. So, how come it would accurately give citations to court cases.

It didn't. It's all over the news now. If "hallucinated" legal citations.

Does this matter? I think it does. This is why I moved my research focus after my PhD back to IR, away from the machine learning. Deep learning can only generalize the facts, so we better start accurately recording facts. This is why I study interoperable and reusable knowledge bases, like WikiPathways, Wikidata, technologies like RDF in science, etc. Actually, this realization predates my machine learning. I guess I already had this notion when I started the Woordenboek Organische Chemie back in the nineties.

Someone has to. I just hope the funding for this fundamental aspect of research doesn't run out. Information Retrieval will remain essential to science for a few decades more.

Paper: "The FAIR Cookbook - the essential resource for and by FAIR doers"

2023-05-22T07:23:00.000+02:00

I think that if you want to make your knowledge FAIR, you should use an open license and RDF. Simple. Now, not everything is knowledge. A lot of data is, but a lot more is not, think raw data. Using RDF to explain a protein sequence is still something that makes me feel uneasy.

However, first, you need to make RDF, you need to make assumptions explicit, you need to decide on meaning. Making RDF is not easy. It's not hard, just a lot of administration and scientific thinking. What did I measure? What model do I use to describe the chemistry? You know, my research job.

Moreover, not only data should be FAIR. All research output (worth communicating) should be FAIR.

In the past, Andra Waagmeester invited me to co-author a recipe that explains the general steps of creating RDF. This was during the Open PHACTS project and with Carina Haupt. Writing recipes is something getting traction. They are a bit like vignettes from the R world.

In the past few years the FAIRplus project created a FAIR Cookbook with recipes and I wrote a few. Actually, I still have a few to finish, for which I cannot find the time. I retrospect, I spent too much time on perfecting the recipe to finish them earlier. The FAIR Cookbook is now a professional venue with editorial board. It is fully open source and welcomes your recipes. Oh, and it is now hosted as ELIXIR service, which is great to see!

Finally, the The FAIR Cookbook - the essential resource for and by FAIR doers paper is out. Go read it :)

Figure 2 from the article: "Citability of recipes and identification of
and credit for authors; an example is provided."

Paper: "Extending inherited metabolic disorder diagnostics with biomarker interaction visualizations"

2023-05-12T21:56:00.004+02:00

When I joined the BiGCaT research group in 2012 I was particularly interested in the open science approach of WikiPathways. As a chemist by training and researcher in cheminformatics, metabolites and their metabolic reactions took my particular interest. I am happy that I have been able to fund Denise's research project. And thanks Denise for this very exciting research. I know it's just a first step, and far more translational steps are needed, but I for one am very exciting to bridge molecular info to clinical outcomes.

In this study, Denise explored how we can take advantage from molecular pathway databases to link biomarker information: "Our framework integrates literature and expert knowledge into machine-readable pathway models, including relevant urine biomarkers and their interactions. The clinical data of 16 previously diagnosed patients with various pyrimidine and urea cycle disorders were visualized on the top 3 relevant pathways. Two expert laboratory scientists evaluated the resulting visualizations to derive a diagnosis" (doi:10.1186/s13023-023-02683-9).

Figure 4 shows how such a visualization of those biomarker interactions can look like:

And I am hugely proud of the open science approach, from GitHub repo, open source R code, SPARQL queries. Thank you, Denise! And thanks to Dr Laura Steinbusch for this nice collaboration! Further acks in the article.

Community activity #1: Bioschemas annotation

2023-05-12T21:34:00.002+02:00

History

In this post, let's look at Bioschemas annotation. Our community has been collaborating with the Bioschemas project for a long time. The development of the ChemicalSubstance profile was started by our community at one of the earlier ELIXIR BioHackathon Europe events. The "ChemicalSubstance" (ChemicalSubstance/0.4-RELEASE) is the material equivalent of "MolecularEntity" that was already being developed. ChemicalSubstance is used in various places, see bioschemas.org/developer/liveDeploys and select the ChemicalSubstance profile.

We have also been using Bioschemas annotation of training material, see https://www.dtls.nl/2018/07/19/toxicology-data-management-tutorials This is still being used in some of the projects I am involved in.

Third, multiple NanoSafety Clusters are very actively using "Dataset" annotation to make them findable in Google's Dataset Search. For example, notice the mention of nanocommons.github.io when searching for "toxicity nanomaterial": datasetsearch.research.google.com/search?query=toxicology%20nanomaterial That shows up because of the Bioschemas annotation on this NanoCommons overview of citable (i.e. with DOI) and openly licensed datasets: nanocommons.github.io/datasets/ We're also working with data platform developers in the NanoSafety Cluster to use such annotation, supporting the F in FAIR.

Why adopt Bioschemas

Bioschemas is a life sciences-oriented extension of schema.org, a platform originally set up by the major search engines, as clear from the Google Dataset Search engine.

What can you do

Joining this effort helps make the research output of your toxicology project more FAIR. Bioschemas (and schema.org) annotation can be added to any HTML page. The Bioschemas project can assist. We want to organize workshops this year to work with toxicology projects for wider adoption. But please feel encouraged to ask questions on this mailing list prior to such activities. One thing I am particularly interested in, is people interested in setting up web sites with information and data about specific chemicals or nanomaterials.

Please also feel encouraged to reply if you already used schema.org or Bioschemas in any of your projects to share your experiences.

Let's make toxicology more FAIR.

CiTO updates #4: annotations in datasets

2023-04-02T09:11:00.006+02:00

Okay, the Pilot is over ending with 17 papers, 16 of which have CiTO annotations (and so far 4 J.Cheminform. papers after the pilot), but my interest in the Citation Typing Ontology continues and we just need more adoption.

Datasets as source of annotations

So, here's a quick Wikidata update. I have been using Wikidata as infrastructure to collect and share CiTO annotations (see also the below "Scholia patch" posts). Some time ago I recovered my CiteULike CiTO annotations and made this available on Zenodo (doi:10.5281/zenodo.7368209).

And while thinking about datasets with CiTO annotations, I found two other datasets. One was from an article in Portuguese and one from an article by Peroni et al. with this data file. That data file is actually a zip, but inside the zip file is a CSV file with three interesting columns: cited_doi, citing_doi, and intext_citation.intent. There are many more columns and I can highly recommend browsing them. But these are the three I need to add data to Wikidata. The third column is free text, but using the CiTO for labels, making it relatively easy to convert to citation intentions from Wikidata (PS, thanks to Fvtvr3r for adding more!).

So, I had a cleaned file and started writing a Groovy Bioclipse script using Bacting. It basically does a few things: extract all DOIs, check which ones are in Wikidata, analyze the intext_citation.intent column content, and then generate QuickStatements (see this gist). Out of the 600 lines from the input, it creates some 200 new CiTO-annotated citations in Wikidata between some 150 article pairs:

The ability to include CiTO annotations from datasets is another welcome boost for the CiTO statistics in Wikidata. This SPARQL query shows an overview of sources that support the CiTO intention annotation, but note that a claim with a CiTO intention may also have CrossRef, PubMed, and COCI as reference. In those cases, they are primarily for the citations and not the intention.

There are now (the latest stats are here) 1202 citation intention annotations in Wikidata for 992 citations from 405 articles in 199 venues. Of these 27 articles have explicit annotations in the article itself and are found in 4 venues, two journals and two preprint servers). These annotated citations are to 510 articles in 190 different venues. This Scholia patch will add a new statistics, the number of datasets providing citation intentions, of which there are (as discussed) currently two in Wikidata. The latter two provide intentions for the majority of articles and are depicted in yellow in the below overview.

With an annotation in an 1938 article by Alan Turing! I ran into this article in November 2011 noting an apparent duplicate title in his article list. I turned out an earlier article had a correction with the same name. I added this clarification:

This is very trivial citation intention data that publishers could provide as open data.

Okay, that will do for today. There are actually some really interesting things in the pipeline, but I will have to write about that later. I have some deadlines I should start looking at. Below is some extra reading.

Some more history

BridgeDb NWO grant update #7: wrapping up the project

2023-03-12T10:56:00.002+01:00

I have received the request to write up the final reporting and the paid practical work has been completed (we already said goodbye to Helena almost a month ago). After the hackathon last month, we released BridgeDb Webservice 2.1.0 and actually had this online for about a week.

Unfortunately, this week we ran into a few regressions and I restored the previous version, solving those issues. Issues were created and solved this week(-end), resulting in the 2.1.1 release.

Along the fixing, tests were created for the problems along with tests for other API methods. This was interesting in itself, because it requires firing up a BridgeDb Webservice in the background and actually load a Derby file (we need some data to test for). Firing up the webservice is one thing (I'm just hoping the port is open when the test runs), but we also need to create two temporary files. One is the gdb.config which points to the Derby file and the Derby file itself. But both are distributed in java archive files (JARs) so need to be saved to a temporary file first. That was doable :)

    public static void startServer() throws IOException {
        // set up a test Derby file
        File derbyFile = File.createTempFile("bdb", "bridge");
        derbyFile.deleteOnExit();
        InputStream stream = RestletServerTest.class.getClassLoader().getResourceAsStream("humancorona-2021-11-27.bridge");
        FileOutputStream derbyStream = new FileOutputStream(derbyFile);
        stream.transferTo(derbyStream);
        derbyStream.close();
        stream.close();

        // set up the GDB config file
        File configFile = File.createTempFile("gdb", "config");
        configFile.deleteOnExit();
        FileOutputStream outputStream = new FileOutputStream(configFile);
        BufferedOutputStream bufferStream = new BufferedOutputStream(outputStream);
        String configFileContent = "*\t" +  derbyFile.getAbsolutePath();
        bufferStream.write(configFileContent.getBytes());
        bufferStream.close();
        outputStream.close();

        // set up the REST service
        RestletServerTest.server = new RestletServer();
        RestletServerTest.server.run(port, configFile, false, false);
    }

During the grant, one task was to set up better testing. We did, but for the new webservice, this was not put in place yet. Particularly, the code coverage of the testing was not set up. That I did this week using the CodeCov services which are free for open source projects. That gives these results:

There clearly is work left to be done. The current testing focuses on the common functionality and the new (alpha) JSON functionality is mostly not tested yet. This will change in the next few weeks.

So, that leaves the reporting and writing the journal article. And cleaning up the lab, of course.

Previous updates

Paper: "PSnpBind-ML: predicting the effect of binding site mutations on protein-ligand binding affinity"

2023-03-12T09:19:00.003+01:00

Ammar Ammar in my group just published the second half of his cheminformatics study into what happens with binding affinities when the proteins show amino acid changes, selected based on world-wide population statistics. His idea what that drugs should be designed to not be selective for a particular genotype. The first paper (see this post) tells the story about how to automate running thousands of docking experiments and explains how to put this knowledge base online, while the paper this month explains how machine learning can learn the patterns found in those docking experiments:

The idea in the PSnpBind-ML paper is simple. We can calculate binding affinities for many ligand-protein complexes. If we calculate enough, one can create a QSAR model that includes ligand and protein information (to capture the SNP uniqueness) to predict that affinity with the QSAR model. That will be a lot more scalable. The 2023 article has the full details, and everything is open.

One thing that fascinated me when Ammar proposed this study is the notion that in this way, for each ligand, we can see how stable the binding affinity is over the various protein variants. Are there proteins which are harder to target because of the variants? Do certain classes of chemical structures show a lot of binding differences over the world-wide protein diversity?

Also, I was interested in if it would work in the first place. I remember from my own PhD days (some 20 years ago now) that docking experiments had a fairly high prediction error. So, when I see this plot on the independent test set, I am intrigued:

So, what about the binding affinity variation? Ammar did not put this figure in the paper, but sent me a copy to put online here.

We here see a boxplot (yeah, there are better alternatives, I know...) showing quite a bit of variation in the binding affinities for the various variants of human Pim-1 kinase (with crystal structures 2C3I, 3BGZ, etc). These plots show that the variation is mostly high, but sometimes quite small indeed. I don't really see a pattern here.

And totally in line with open science, each combination of docked ligand and mutated protein can be looked at online with Jmol, e.g. this one:

In this case, the amino acid change is right next to the ligand. Ammar selected them as such. Of course, biology in reality is much more complex. And maybe the differences we see here are not even significant compared to other effects.

But one thing keeps wondering, and I hope someone can explain this to me, in the past I would see experimental data on ligand-protein binding referring to the protein, but not so much the protein variant. We would need a lot of experimental measurements of ligands binding to protein variants to validate this.

But all this uncertainty of the biological and drug discovery implications, there is another reason why I am really happy about this story. First, the openness and the ability to share it FAIR-ly online (check his use of w3id, e.g. https://w3id.org/psnpbind/protein/2c3i), and, second, the notion that we can do things now at this scale. With all the deep learning discussions ongoing, the ability to inspect in detail what these models do, how they behave, the "explainable AI" if you like, is essential and Ammar showed here how to do that.

Thinking back about the study about pKa's of warfarin tautomers, being all over the place from very basic to very acidic, it is nice to see some data on the effect of the SNPs on the binding affinities.

I am sure you have some thoughts on this work. We did ask someone about the idea before we started, and we were told it had limited use. Use the comment section, or better even, write a reply blog post on your own platform, or send us an email. Looking forward to hearing from you.

Why I free up time to give lectures (and about ChatGPT)

2023-02-19T09:19:00.001+01:00

This week a colleague whom I highly respect asked me if I was already so busy (regularly close to overworked), why did I give talks and often free up my time for that. A valid question. The Drew-reaction here is to say "it is part of scientific communication and dissemination". But does that hold when writing deliverables (also communication and dissemination) should take priority?

So, here's my Gun-reaction. I think there are two aspects I take into account on top of the "this is what scholars do" and "I learned it like this": the need for debate, the need for human collaboration. Arguably, these are the same thing, but intuitively I think the first is actually more about deepening our understanding, while the second is more about gratification. Interestingly, the first is more about Gun while the second is more about Drew. The second is why so many people like ChatGPT, the immediate gratification: it fills our immediate needs for facts. ChatGPT is however Drew, not Gun: it associates and does not reason.

So, how about the debate. Reading science books, watching Veritasium, these are communication and an attempt at dissemination. But without the sparring, without the debate. And we know from theory that importance of saying out loud what you think you know. Think Feynmann's claims about teaching.

Interestingly, this is why I enjoy data curation: it requires me to teach others what I think I know. Annoyingly, it also makes me very aware of the tiniest mistakes people make. This has helped me (somewhat) as editor, but at the same time found this very tiresome and frequently depressing.

That brings me back to the giving of lectures and presentations. If I do my job well, I will get questions. I will be challenged and demands me to activate my knowledge. This, of course, is the scientific debate. This is why there is so much to say for open peer review of journal articles. It does wonders with peer reviewing open source (we have been employing peer review in the Chemistry Development Kit for almost two decades now).

A lecture, a talk, it is for me an essential part of ensuring the quality of my research. Explaining to others what I know is part of my research. Absolutely worth making time for.

BridgeDb NWO grant update #6: second hackathon

2023-02-11T09:24:00.001+01:00

This week the 2nd NWO Open Science BridgeDb grant hackathon took place. In all honestly, I had hoped we could open it up to a much larger community, but in our defense, the grant team is small, and we were flooded with various viruses in The Netherlands. Second, we need to get a lot if community feedback on additionally needed identifier mapping needs, except for support for Simple Standard for Sharing Ontological Mappings (SSSOM). This needs, however, more coding and we do not have the resources for that right now. Nevertheless, we had a great hackathon with people involved in the grant, including several people from other projects (aka "matching").

Projects

Before the meeting, several project ideas were written down, mostly related to remaining open tasks of the grant proposal (see Update #5). On the first day, the BridgeDb 3 Docker and BridgeDb Webservice JSON support were merged, which actually made sense. The work of Helena, Marvin, Ozan, and Javi paid off. The new docker is on DockerHub, automatically made with GitHub Actions (see top right screenshot). The overcame multiple small issues, like CORS support, port matching, dynamic configuration, etc. But this Docker should be easily deployable and allow projects like VHP4Safety and EOSC automagically keep up with the latest BridgeDb software and data.

Other projects focused on ID mapping databases. Myself, I worked on the first nanomaterial ID mapping database, an idea that was first pitch back in 2013 in the eNanoMapper proposal. So far, there was so little data and databases around, the ID mapping was never really needed. For this, updates were needed new releases in BridgeDb Datasources and BridgeDb Java. This is, fortunately, changing. Along the process Tooba and Ammar worked out a short recipe how to inspect the content of BridgeDb ID mapping databases, which technically are Apache Derby files.

At the end of the meeting, I updated the BridgeDbR package (2.9.1) with the latest Java libraries and looking into the technical possibility of a PathVisio3 release with the latest BridgeDb. But we really need a NWO Open Science or eScience Center grant for PathVisio to continue the work started by the COVID19 ZonMw grant. Funders that want to support important life sciences research are strongly encouraged to contact us and help us write the grant proposal that they want to fund.

Next

The grant funding is about the run out and its contribution to our research software position is too. As such, our focus is now going to be on the writing of the final reporting. This hackathon greatly contributed to the results and it was a wise decision to include those in the grant proposal.

Previous updates

Citation Typing: progress but we need more uptake

2023-02-05T11:01:00.004+01:00

It is now almost thirteen years ago that Prof. Shotton wrote their article about CiTO, the Citation Typing Ontology (doi:10.1186/2041-1480-1-S1-S6). For long it was the only article with CiTO annotations in the article itself, explaining why the authors cited those articles, here reference 8 from Shotton's article:

I wanted this. I was collecting reasons why people were citing the Chemistry Development Kit articles. I started using it, CiteULike added support. Sadly, CiteULike got shut down at some point.

Fast forward to 2020, we started a Pilot in the Journal of Cheminformatics to allow authors to annotate their citations as in the above reference 8 with a compact notation (doi:10.1186/s13321-020-00448-1). I have been collecting these explicit CiTO annotations (unlike the post-publication annotations I collected in CiteULike) in Wikidata and summarized in Scholia, and this is what it looks in Wikidata for an article:

This two year Pilot has now been concluded (doi:10.1186/s13321-023-00684-1) and I wrote a commentary on how authors used it during these two years: "Two years of explicit CiTO annotations" (doi:10.1186/s13321-023-00683-2). I am happy to see authors continue to annotate their article! This below histogram shows the number of articles per year with explicit annotation; besides the Journal of Cheminformatics, you can find additional article on two preprint repositories!

Mind you, I know there are already BioHackrXiv preprints with CiTO annotation in 2023, but I am not keen on putting preprints in Wikidata. I could know, because one is the preprint describing CiTO support in BioHackrXiv (doi:10.37044/osf.io/6rjvc):

So, we are making progress, but a lot needs to happen. We need more journal editors to support CiTO annotation in submissions. For Springer Nature journals this is technically easy, but the (publisher) editors need to monitor the typesetting to ensure the pubnotes do not get lost.

What else? Well, we need databases like PubMed, EuropePMC to support this too. We need some FAIR formats to support sharing post-publication CiTO annotation, like I used CiteULike for, but also done in literature studies, e.g. like this paper by Duca et al.

And we need support in tools like Zotero and EndNote. This is actually non-trivial, because the CiTO annotation is linked to the citation not to the bibliographic information in the tool. So, it needs to be integrated at the level of the Word/Google Docs plugin.

I was also thinking that what I miss is an overview of datasets that use CiTO. Just the list of articles citing the original CiTO paper does not seem to do justice to the use in database.

I have good hopes the story will continue. The wide adoption of Open Science has already taken more than two decades. I can wait a bit longer for wide adoption of CiTO.

Scholia timeline

2023-01-27T16:06:00.002+01:00

Source.

Sometimes I think back about how Scholia started, and then I think I remember a Twitter discussion. Twitter was a social platform that was unable to fight hate speech. I left it last year in favor of Mastodon.

Anyway, I did some digging today and found this thread from October 8-9 2016. A few days earlier, Finn has created a profile based on data in Wikidata on his homepage, which I was very happy about. You can see how Dario suggests to put that webpage up on Toolforge. For completeness, this is the first commit, October 9.

This chat was after @fnielsen's blog post about the idea of the needed open infrastructure and a possible Wikidata solution from September 2016. Finally, it was also only half a year before Scholia got mentioned in Nature.

BTW, at the time there still was a focus on bibliographic information. We learned since that the Wikidata platform cannot technically meet the needs, at least not at this moment. Instead, the focus is now much more about the literature that supports the knowledge in Wikidata and Wikipedia and make that as interoperable as possible.

Doing the "Open Science Challenge"

2023-01-15T12:47:00.010+01:00

Screenshot of the sign up page.

Triggered by the "reflections on your career" in the announcement I decide to give the Open Science Challenge by Heidi Seibold a try: "12 emails over the course of a month that are designed to help you on your Open Science journey."

I will post here my replies to the various challenges, by linking to the first Mastodon, allowing you to follow the replies:

Day 1: Why am I participating
Day 2: Your Open Science peers
Day 3: Write down all of your projects and put them in a (im)portant/(un)passionate matrix
Day 4: Stop working on your CV
Day 5: Open Materials
Day 6: Open Code
Day 7: Mindsets that hold you back
Day 8: Science Communication
Day 9: Social Change
Day 10: Open Access
Day 11: Ethics and Research
Day 12: Wrap up

Paper: "Guiding the choice of informatics software and tools for lipidomics research applications"

2022-12-25T10:51:00.003+01:00

Screenshot of this LIPID MAPS webpage.

One of the outcomes of the EpiLipidNET COST action is a paper about the data analysis of experimental lipidomics data: Guiding the choice of informatics software and tools for lipidomics research applications (doi:10.1038/s41592-022-01710-0).

Our BiGCaT team wrote up BridgeDb for identifier mapping and WikiPathways for pathways/enrichment analysis. See also the WikiPathways Lipids Portal.

But I also wanted to map the tools from the article to ELIXIR databases, particularly FAIRsharing, bio.tools, and TeSS. I wish journals would just require this as part of the wish to make science more FAIR. While at it, I realized I could also add Wikidata item annotations and link to Scholia (see also these blog posts). And while add it, I improved the links between the items on the software with the journal articles describing the software and tools, including the citation networks.

I know it takes time and I would have loved to have this curation done before the publication. But I couldn't. But I just started adding the annotation in this GitHub repository:

BridgeDb NWO grant update #5: BioHackathon, Webservice, Bioregistry

2022-12-11T08:43:00.003+01:00

Our new Mastodon account.

So, I had a lot of teaching and that besides project deliverables and final reports, a few project meetings, it left me with little time to blog my monthly BridgeDb NWO grant update. But here goes, as a lot did happen in the background.

First, some outreach:

2022-09-15, Open Science in Practice Webinar Series: BridgeDb and Wikidata: a powerful combination generating interoperable open research (video)
2022-10-20, UM Data Science Research Seminar: Making research output FAIR with Wikidata

Work Package 1

In WP1 we continued working on the core BridgeDb library (after we split out the BridgeDb Webservice into a separate repository). Last time we reported about Bioregistry support (used on the new WikiPathways website). I am happy this paper now got published. BridgeDb Java 3.0.16, 3.0.17, and 3.0.18 have been released. No big changes, but mostly additional features for WikiPathways and the upcoming libGPML and PathVisio 4.0. The latest release also comes with an updated BridgeDb Datasources.

Work Package 2

This is where the most work happened in the last few months. Helena has been working on the BridgeDb Webservice code. This code was 10 years old and desperately needed an upgrade. In the proposal we mention JSON, compact identifiers (or the CURIEs from Bioregistry), and more FAIRness. After a few weeks of learning the used REST library and hacking, Helena got content negotiation working and we are happy to report that the upcoming release will support JSON. Even better, it also solves the problem we had with the Docker image.

Work Package 3

Mapping databases continue to be updated. For the metabolite identifier mapping database, Denise has been looking into wrapping this in a Docker image and all future mapping databases will use schema 4 of the BridgeDb database schema (support primary/secondary identifier annotation). Denise, Martina, Tooba, and I participated in the ELIXIR BioHackathon Europe where we worked on several projects. Project 26 looked into more identifier mapping with Wikidata and PubChem and improving interoperability of the Bioschemas MolecularEntity profile. We also spoke with the TogoID team from Japan on interoperability and possibilities of collaboration.

Next

With a few months left until the end of the project, we are going to focus on wrapping up the progress. The webservice needs an updated OpenAPI documentation and a proper release and documentation how anyone can run a local BridgeDb Webservice easily (and we can update the EOSC instance). We also have a (ELIXIR) stakeholder-oriented workshop to organize.

Oh, and we created a Mastodon account: @bridgedb@fosstodon.org!

Previous updates