Tuesday, December 22, 2015

New Edition! Getting CAS registry numbers out of WikiData

Source: Wikipedia. CC-BY-SA

April this year I blogged about an important SPARQL query for many chemists: getting CAS registry numbers from Wikidata. This is relevant for two reasons:
  1. CAS works together with Wikimedia on a large, free CAS-to-structure database
  2. Wikidata is CCZero
The original effort validated about eight thousand registry numbers, made available via Wikipedia and the Common Chemistry website. However, the effort did not stop there, and Wikipedia now contains many more CAS registry numbers. In fact, Wikidata picked up many of these and now lists almost twenty thousand CAS numbers. That well exceeds what databases are allowed to aggregate and make available.

Since the post in April, Wikidata put online a new SPARQL end point and created "direct" property links. This way, you loose the provenance information, but the query becomes simpler:
    PREFIX wdt: <>
    SELECT ?compound ?id WHERE {
      ?compound wdt:P231 ?id .
The other thing that changed since April is that others and I requested the creation of more compound identifiers, and here's an overview along with the current number of such identifiers in Wikidata:
Clearly, some identifiers are not well populated yet. This is what bots are for, like those used by the Andrew Su team.

Because there is also a predicate for SMILES, we can also create a query that puts the CAS registry number alongside to the SMILES (or any other identifier):
    PREFIX wdt: <>
    SELECT ?compound ?id ?smiles WHERE {
      ?compound wdt:P231 ?id ;
                wdt:P233 ?smiles .
Of course, then the question is, are these SMILES string valid...And, importantly, this is nothing compared to the number of chemical compounds we know about, which currently is in the order of 100 million, of which a quarter can be readily purchased:

Willighagen, E., 2015. Getting CAS registry numbers out of WikiData. The Winnower.