Wednesday, February 21, 2018

When were articles cited by WikiPathways published?

Number of articles cited by curated WikiPathways, using
data in Wikidata (see text).
One of the consequences of the high publication pressure is that we cannot keep up converting all those facts in knowledge bases. Indeed, publishers, journals more specifically do not care so much about migrating new knowledge into such bases. Probably this has to do with the business: they give the impression they are more interested in disseminating PDFs than disseminating knowledge. Yes, sure there are projects around this, but they are missing the point, IMHO. But that's the situation and text mining and data curation will be around for the next decade at the very least.

That make any database uptodateness pretty volatile. Our knowledge extends every 15 seconds [0,1] and extracting machine readable facts accurately (i.e. as the author intended) is not trivial. Thankfully we have projects like ContentMine! Keeping database content up to date is still a massive task. Indeed, I have a (electronic) pile of 50 recent papers of which I want to put facts into WikiPathways.

That made me wonder how WikiPathways is doing. That is, in which years are the articles published cited by pathways from the "approved" collection (the collection of pathways suitable for pathway analysis). After all, if it does not include the latest knowledge, people will be less eager to use it to analyse their excellent new data.

Now, the WikiPathways RDF only provides the PubMed identifiers of cited articles, but Andra Waagmeester (Micelio) put a lot of information in Wikidata (mind you, several pathways were already in Wikidata, because they were in Wikipedia). That data is current not complete. The current count of cited PubMed identifiers (~4200) can be counted on the WikiPathways SPARQL end point with:
    PREFIX cur: <>
    SELECT (COUNT(DISTINCT ?pubmed) AS ?count)
    WHERE {
      ?pubmed a wp:PublicationReference ;
        dcterms:isPartOf ?pathway .
      ?pathway wp:ontologyTag cur:AnalysisCollection .
Wikidata, however, lists at this moment about 1200:
    SELECT (COUNT(DISTINCT ?citedArtice) AS ?count) WHERE {
      ?pathway wdt:P2410 ?wpid ;
               wdt:P2860 ?citedArtice .
Taking advantage of the Wikidata Query Service visualization options, we can generate a graphical overview with this query:
    SELECT (STR(SAMPLE(?year)) AS ?year)
           (COUNT(DISTINCT ?citedArtice) AS ?count)
    WHERE {
      ?pathway wdt:P2410 ?wpid ;
               wdt:P2860 ?citedArtice .
      ?citedArtice wdt:P577 ?pubDate .
      BIND (year(?pubDate) AS ?year)
    } GROUP BY ?year
The result is the figure given as the start (right) of this post.