One thing that machine readability adds, is all sorts of machine processing. Validation of data consistency is one. For SMILES strings, one of the things you can do is test of the string parses at all. Wikidata is machine readable, and, in fact, easier to parse than Wikipedia, for which the SMILES strings were validated recently in a J. Cheminformatics paper by Ertl et al. (doi:10.1186/s13321-015-0061-y).

Because I was wondering about the quality of the SMILES strings (and because people ask me about these things), I made some time today to run a test:

SPARQL for all SMILES strings process each one of them with the CDK SMILES parser I can do both easily in Bioclipse with an integrated script:

identifier = "P233" // SMILES

type = "smiles"

sparql = """

PREFIX wdt: <http://www.wikidata.

doi:10.15200/winn.145228.82018

April this year I blogged about an important SPARQL query for many chemists: getting CAS registry numbers from Wikidata. This is relevant for two reasons:

CAS works together with Wikimedia on a large, free CAS-to-structure database Wikidata is CCZero The original effort validated about eight thousand registry numbers, made available via Wikipedia and the Common Chemistry website.

Earlier this week there was a question on the WikiPathways mailing list about the webservices. There are older SOAP webservices and newer REST-like webservices, which come with this nice Swagger webfront set up by Nuno. Of course, both approaches are pretty standard and you can use them from basically any environment. Still, some personas prefer to not see technical issues: "why should I know how an car engine works". I do not think any scholar is allowed you use this argument, but alas...

Last week the BiGCaT team were present with three person (Linda, Ryan, and me) at the Sematic Web Applications and Tools 4 Life Sciences meeting in Cambridge (#swat4ls). It's a great meeting, particularly because if the workshops and hackathon. Previously, I attended the meeting in Amsterdam (gave this presentation) and Paris (which I apparently did not blog about).

Nanomaterials are quite interesting from a science perspective: first, they are materials and not so well-defined as such. The can best be described as a distribution of similar nanoparticles. That is, unlike small compounds, which we commonly describe as pure materials. Nanomaterials have a size distribution, surface differences, etc. But akin the QSAR paradigm, because they are similar enough, we can expect similar interaction effects, and thus treat them as the same.

Got access to literature? Only yesterday I discovered that resolving some Nature Publishing Group DOIs do not necessarily lead to useful information. High quality metadata about literature is critical for the future of science. Elsevier just showed how creative publishers can be in interpreting laws and licenses (doi:10.1038/527413f).

So, it may be interesting to regularly check your machine readable Open Access metadata. ImpactStory helps here with their Open Access Badge.

Biology is a complex matter. The biological matter indeed involves many different chemicals in very many temporospatial forms: small compounds may be present in different charge states (proteins too, of course), tautomers, etc. Proteins may exhibit isoforms, various post-translational modifications, etc. Genes shows structures we are only now starting to see: the complex structures in the nucleus have been invisible to mankind until some time ago.

Machine learning is a field of science that focusses on mathematically describing patterns in data. Chemometrics does this for chemical data. Examples are (nano)QSAR where structural information is related to biological activity. I studied during my PhD studies the interaction between the statistics and machine learning with how you computationally (numerically) represent the question.

So, you validated your list of SMILES in the paper you were planning to use (or about to submit), and you found a shortlist of SMILES strings that do not look right. Well, let's visualize them.

We all used to use the Daylight Depict tool, but this is no longer online. I blogged previously already about using AMBIT for SMILES depiction (which uses various tools for depiction; doi:10.1186/1758-2946-3-18), but now John May released a CDK-only tool, called CDK Depict.
About Me
About Me
Popular Posts
Popular Posts
Blog Archive
Blog Archive
Loading