Friday, April 17, 2009

Downloading Domoic Acid from PubChem

The identity of domoic acid has been under discussion (see here, here and here). (And I very much like the ChemSpider service to make it easy to copy data from ChemSpider into WikiPedia ChemBoxes; cheers!)

Now, my practical in next weeks CDK Workshop will use Groovy (please install it on your laptop!), and am hacking up example scripts for the course material, and came up with this script to download the structure of domoic acid from PubChem (CID:5282253):


  1. Egon, In order to download the correct structure of Domoic Acid you have to know the exact record on PubChem? A text-based search of Domoic Acid gives 5 structures on PubChem so without knowing the exact record you'd be stuck. How would you figure out what the appropriate record is on PubChem in general?

  2. Bioclipse has a free text search option, which returns the first 15 hits for the text you search on. Regarding what the real true one is... that depends on curation indeed. I will try to write up similar material for interaction with ChemSpider. Something like a Bioclipse ChemSpider plugin will be easier to do when a good programming API comes online, but something like what is shown in this blog should not be hard. Still, I would be much interested in drawing a substructure in Bioclipse and searching in ChemSpider using that.

  3. Thats indeed an unsatisfying effect of large deposition databases and I would love to see any suggestion how to perform text searches via interfaces. Whenever it comes to searching for names in pubchem or chemspider you will probably get more then one hit, which needs a human to decide which one you wanted and which not - or you evaluate a second parameter of your entity afterwards.

  4. IMHO this example demonstrates serious problems with the CDK methodology. In order to set this up, you need precise and specific knowledge about:

    a) 3 import packages
    b) 1 specific reader object and its methods
    c) 1 molecule object and its attributes
    d) 1 download URL (and it reads the XML data which is slow and not always kosher, ASN.1 data is the gold standard)

    Compare this with the much shorter, more robust and equivalent Cactvs script:

    echo "Atom count: [ens get [ens create 5282253] E_NATOMS]"


    echo "Atom count: [ens get [ens create {domoic acid}] E_NATOMS]"

  5. Some information regarding name lookups in PubChem: The presented names are not in random order. Rather, a reliability score is computed, and the name identified as probably the most trustworthy is listed first. So using the first name is a usually reasonable choice (but of course not foolproof, this is not hand-curated data)