Saturday, June 16, 2018

Represenation of chemistry and machine learning: what do X1, X2, and X3 mean?

 Modelling doesn't always go well and the model is lousy atpredicting the experimental value (yellow).
Machine learning in chemistry, or multivariate statistics, or chemometrics, is a field that uses computational and mathematical methods to find patterns in data. And if you use them right, you can make it correlate those features to a dependent variable, allowing you to predict them from those features. Example: if you know a molecule has a carboxylic acid, then it is more acidic.

The patterns (features) and correlation needs to be established. An overfitted model will say: if it is this molecule than the pKa is that, but if it is that molecule then the pKa is such. An underfitted model will say: if there is an oxygen, than the compound is more acidic. The field of chemometrics and cheminformatics have a few decades of experience in hitting the right level of fitness. But that's a lot of literature. It took me literally a four year PhD project to get some grips on it (want a print copy for your library?).

But basically all methods work like this: if X is present, then... Whether X is numeric or categorical, X is used to make decisions. And, second, X rarely is the chemical itself, which is a cloud of nuclei and electrons. Instead, it's that representation. And that's where one of the difficulties comes in:
1. one single but real molecular aspect can be represented by X1 and X2
2. two real molecular aspects can be represented by X3
Ideally, every unique aspect has a unique X to represent it, but this is sometimes hard with our cheminformatics toolboxes. As studied in my thesis, this can be overcome by the statistical modelling, but there is some interplay between the representation and modelling.

So, how common is difficulty #1 and #2. Well, I was discussing #1 with a former collaborator at AstraZeneca in Sweden last Monday: we were building QSAR models including features that capture chirality (I think it was a cool project) and we wanted to use R, S chirality annotation for atoms. However, it turned out this CIP model suffers from difficulty #1: even if the 3D distribution of atoms around a chiral atom (yes, I saw the discussion about using such words on Twitter, but you know what I mean) does not change, in the CIP model, a remote change in the structure can flip the R to an S label. So, we have the exact same single 3D fragment, but an X1 and X2.

 Source: Wikipedia, public domain.
Noel seems to have found in canonical SMILES another example of this. I had some trouble understanding the exact combination of representation and deep neural networks (DNN), but the above is likely to apply. A neural network has a certain number if input neurons (green, in image) and each neuron studies one X. So, think of them as neuron X1, X2, X3, etc. Each neuron has weighted links (black arrows) to intermediate neurons (blueish) that propagate knowledge about the modeled system, and those are linked to the output layer (purple), which, for example, reflects the predicted pKa. By tuning the weights the neural network learns what features are important for what output value: if X1 is unimportant, it will propagate less information (low weight).

So, it immediately visualizes what happens if we have difficulty #1: the DNN needs to learn more weights without more complex data (with a higher chance of overfitting). Similarly, if we have difficulty #2, we still have only one set of paths from a single green input neuron to that single output neuron; one path to determine the outcome of the purple neuron. If trained properly, it will reduce the weights for such nodes, and focus on other input nodes. But the problem is clear too, the original two real molecular aspects cannot be seriously taken into account.

What does that mean about Noel's canonical SMILES questions. I am not entirely sure, as I would need to know more info on how the SMILES is translated (feeded into) the green input layer. But I'm reasonable sure that it involves the two aforementioned difficulties; sure enough to write up this reply... Back to you, Noel!

Saturday, June 02, 2018

Supplementary files: just an extra, or essential information?

The Narrative
Journal articles have evolved in an elaborate dissemination channel focusing on the narrative of the new finding. Some journals focus on recording all the details to reproduce the work, while others focus just on the narrative, overviews, and impact. Sadly, there are no databases that tell you which journal does what.

One particularly interesting example is Nature Methods, a journal dedicated to scientific methods... one would assume the method to be an important part of the article, right? No, think again, as seen in the PDF of this article (doi:10.1038/s41592-018-0009-z):

Supplementary information
An intermediate solution is the supplementary information part of publications. Supplementary information, also called Additional files, are a way for author to provide more detail. In an proper Open Science world, it would not exist: all the details would be available in data repositories, databases, electronic notebooks, source code repositories, etc, etc. The community is moving in that direction, and publishers are slowly picking this up (slowly? Yes, more than 7 years ago I wrote about the same topic): supplementary information remains.

But there are some issues with supplementary information (SI), which I will discuss below. But let me first say that I do not like this idea of supplementary information at all: content is either important to the paper, or it is not. If it is just relevant, then just cite it.

 I am not the only one who finds it not convenient that SI is not integral part of a publication.
Problem #1: peer review
A few problems is that it is not integral part of the publications. Those journals that still do print, and there likely lies the original of the concept, do not want to print all the details, because it would make the journal too thick. But when an article is peer-reviewed, content in the main article is better reviewed than the SI. Does this cause problems? Sure! Remember this SI content analysis of Excel content (doi:10.1186/s13059-016-1044-7)?

One of the issues here is that publishers do not generally have computer-assisted peer-review / authoring tools in place; tools that help authors and reviewers get a faster and better overview of the material they are writing and/or reviewing. Excel gene ID/data issues can be prevented, if we only wanted. Same for SMILES strings for chemical compounds, see e.g. How to test SMILES strings in Supplementary Information but there is much older examples of this, e.g. by the Murray-Rust lab (see doi:10.1039/B411033A).

Problem #2: archiving
A similar problems is that (national/university) libraries do not routinely archive the supplementary information. I cannot go to my library and request the SI of an article that I can read via my library. It gets worse, a few years ago I was told (from someone with inside info) a big publisher does not guarantee that SI is archived at all:

I'm sure Elsevier had something in place now (which I assume to be based on Mendeley Data), as competitors have too, using Figshare, such as BioMed Central:

But mind you, Figshare promises archival for a longer period of time, but nothing that comes close to what libraries provide. So, I still think the Dutch universities must handle such data repositories (and databases) as essential component of what to archive long time, just like books.

Problem #3: reuse
A final problem I like to touch upon is very briefly is reuse. In cheminformatics it is quite common to reuse SI from other articles. I assume mostly because the information in that dataset is not available from other resources. With SI increasingly available from Figshare, this may change. However, reuse of SI by databases is rare and also not always easy.

But this needs a totally different story, because there are so many aspects to reuse of SI...

Friday, May 25, 2018

Silverbacks and scientific progress: no more co-authorship for just supervision

 A silverback gorilla. CC-BY-SA Raul654.
Barend Mons (GO FAIR) frequently uses the term silverback to refer to more senior scientists effectively (intentionally or not) blocking progress. When Bjoern Brembs posted on G+ today that Stevan Harnad proposed to publish all research online, I was reminded of Mons' gorillas.

My conclusion is basically that every senior scholar (after PhD) is basically a silverback. And the older we get the more back we become, and the less silver. That includes me, I'm fully aware of that. I'm trying to give the PhD candidates I am supervising (Ryan, Denise, Marvin) as much space as I can and focus only on what I can teach them. Fairly, I am limited in that too: grants put pressure on what the candidates must deliver on top of the thesis.

The problem is the human bias that we prefer to listen to more senior people. Most of us fail that way. It takes great effort to overcome that bias. Off topic, that is one thing which I really like about the International Conference on Chemical Structures that starts this Sunday: no invited speakers, no distinction between PhD candidates and award winners (well, we get pretty close to that); also, organizers and SAB members never get an oral presentation: the silverbacks take a step back.

But 80% of the innovation, discovery we do is progress that is hanging in the air. Serendipity, availability of and access to the right tools (which explains a lot of why "top" universities stay "top"), introduce some bias to who is lucky enough to find it. It's privilege.

No more co-authorship for just supervision
Besides the so many other things that need serious revision in journal publishing (really, we're trying that at J.Cheminform!), one thing is that we must stop being co-author on papers, just for being supervisor: if we did not contribute practical research, we should not be co-author.

Of course, the research world does not work like that. Because people count articles (rather than seeing what research someone does); we value grant acquisition more than doing research (only 20% of my research time is still research, and even that small amount takes great effort). And full professors are judged on the number of papers they publish, rather than the amount of research done by people in his group. Of course, the supervision is essential, but that makes you a great teacher, not an active researcher.

BTW, did you notice that Nobel prizes are always awarded for work to last authors of the papers describing the work, and the award never seems to mention the first author?

Disclaimer
BTW, noticed how sneakingly the gender-bias sneaked in? Just to make clear, female scholars can be academic silverbacks just as well!

Sunday, March 25, 2018

SPLASHes in Wikidata

A bit over a year ago I added EPA CompTox Dashboard IDs to Wikidata. Considering that an entry in that database means that likely is something known about the adverse properties of that compound, the identifier can be used as proxy for that. Better, once the EPA team starts supporting RDF with a SPARQL end point, we will be able to do some cool federated queries.

For metabolomics the availability of mass spectra is of interest for metabolite identification. A while ago the SPLASH was introduced (doi:10.1038/nbt.3689), and adopted by several databases around the world. After the recent metabolomics winterschool it became apparent that this is now enough adopted to be used in Wikidata. So, I proposed a new SPLASH Wikidata property, which was approved last week (see P4964). The MassBank of North America (MoNA; Fiehn's lab) team made available a mapping the links the InChI for the compounds with SPLASH identifiers for spectra for that compound, as CCZero.

So, over the weekend I pushed some 37 thousand SPLASHes into Wikidata :)

This is for about 4800 compounds.

Yes, technically, I used the same Bioclipse script approach as with the CompTox identifiers, resulting in QuickStatements. Next up is SPLASHs from the Chalk's aforementioned OSDB.

Wednesday, February 21, 2018

When were articles cited by WikiPathways published?

 Number of articles cited by curated WikiPathways, using data in Wikidata (see text).
One of the consequences of the high publication pressure is that we cannot keep up converting all those facts in knowledge bases. Indeed, publishers, journals more specifically do not care so much about migrating new knowledge into such bases. Probably this has to do with the business: they give the impression they are more interested in disseminating PDFs than disseminating knowledge. Yes, sure there are projects around this, but they are missing the point, IMHO. But that's the situation and text mining and data curation will be around for the next decade at the very least.

That make any database uptodateness pretty volatile. Our knowledge extends every 15 seconds [0,1] and extracting machine readable facts accurately (i.e. as the author intended) is not trivial. Thankfully we have projects like ContentMine! Keeping database content up to date is still a massive task. Indeed, I have a (electronic) pile of 50 recent papers of which I want to put facts into WikiPathways.

That made me wonder how WikiPathways is doing. That is, in which years are the articles published cited by pathways from the "approved" collection (the collection of pathways suitable for pathway analysis). After all, if it does not include the latest knowledge, people will be less eager to use it to analyse their excellent new data.

Now, the WikiPathways RDF only provides the PubMed identifiers of cited articles, but Andra Waagmeester (Micelio) put a lot of information in Wikidata (mind you, several pathways were already in Wikidata, because they were in Wikipedia). That data is current not complete. The current count of cited PubMed identifiers (~4200) can be counted on the WikiPathways SPARQL end point with:
PREFIX cur: <http://vocabularies.wikipathways.org/wp#Curation:>

SELECT (COUNT(DISTINCT ?pubmed) AS ?count)
WHERE {
?pubmed a wp:PublicationReference ;
dcterms:isPartOf ?pathway .
?pathway wp:ontologyTag cur:AnalysisCollection .
}
Wikidata, however, lists at this moment about 1200:
SELECT (COUNT(DISTINCT ?citedArtice) AS ?count) WHERE {
?pathway wdt:P2410 ?wpid ;
wdt:P2860 ?citedArtice .
}
Taking advantage of the Wikidata Query Service visualization options, we can generate a graphical overview with this query:
#defaultView:AreaChart
SELECT (STR(SAMPLE(?year)) AS ?year)
(COUNT(DISTINCT ?citedArtice) AS ?count)
WHERE {
?pathway wdt:P2410 ?wpid ;
wdt:P2860 ?citedArtice .
?citedArtice wdt:P577 ?pubDate .
BIND (year(?pubDate) AS ?year)
} GROUP BY ?year
The result is the figure given as the start (right) of this post.