Monday, May 11, 2009


PubChem-CDK is a project that runs CDK code on the PubChem data. As we speak. a groovy script reads about 100 PubChem Compounds XML entries per second into the database. Mind you, not the SDF they distribute which uses a custom extension to overcome the limits of the real MDL SDF format.

Right now, it has run the atom type perception algorithm on about 1M compounds, and has a pretty good coverage of the organic chemistry domain. I will analyze the results statistically soon, but will likely use this data first to add some missing atom types to CDK 1.2.x. BTW, did you know only three carbon atoms failed? A C4- (CID:156031), a C3+ (CID:161072), and a C2+ (CID:161073). Would your cheminformatics library know what their properties are?

It is really nice way of browsing PubChem, BTW. For example, did you know there are several boron compounds which have a substructure [N+]-[B+]-[N+]? Yes, three positive charges, next to each other? For example (CID:3612285):

Well, neither did I. How was it synthesised? What are the spectral properties? How do they stabilise it? What magic counter ion? PubChem, unfortunately, does not have links to primary literature, and there is no free source for that available. A failure in chemistry. The source points to ChemDB, but the entry in that database does not shed light on this either.

Anyway, more on this later. Much more, as I plan to run many CDK algorithms on this code.


  1. I don't believe it.

    If it's anything the B should be -ve as in BH4-. That would give it a single positive charge

  2. This is a simple failing of the dative bond representation in PubChem... as funny as it looks... :-)

  3. Use of the XML data from PubChem is actually discouraged. There are data type mapping and element name scope problems in that format, and the XML-ASN.1 mapping is not guaranteed to be stable and final. If you want to do it right, parse the binary ASN.1 encoding (which is also much more compact, and faster to parse).

    That is entirely doable (the Cactvs toolkit successfully parses binary ASN.1 both for structures and assays).

    Please, no half-baked "solutions" which stop working after 6 months with the next update.

  4. Dear Anonymous, how would you explain the charges?

    Anyway, dative bonds indeed special in PubChem, and they had to customize the SDF output because of them, making it non-standard. It is better to use the (XML) ASN.1 format.

  5. If you look closely, no dative bonds are assigned in PubChem (but supported?)... the positive charges are simple formal charge adjustments due to the bonding (4 bonds to nitrogen, requires positive nitrogen).

    It is likely how PMR suggests it was intended... BH4+ (or maybe BH2+) with two lone pair donations by NH3.

    Clearly, this is a failure of the PubChem standardization procedures to "correct" the deposited structure. A more sophisticated scheme to detect such nonsense should correct things... but in the end, you can see that this came from a single depositor and PubChem allowed it through:

    The other examples you show detail how difficult it is to represent metal alloys... where the bonding situation cannot be adequately described by the SDF format. C4- is there with Tl+ (while the synonym refers to Ti+). This was also provided by a single depositor (well ChemSpider too... but they absorb all PubChem content)...

    The situation is similar with the C3+ case... so-called a methylium ion... also from ChemIDplus.

    One can imagine that there must be a database of allowed valences being used... maybe they could (should?) edit it down further to avoid such high energy cases... I guess it is a matter of focus.

  6. Dear Wolfgang,

    not sure what you are implying here, and I am sure Cactvs is superior anyway (actually, Cactvs was the toolkit we installed in 1994/95 on our Solaris systems for drawing molecules), but do you suggest that PubChem is using half-baked solutions by putting up half-finished file formats on their website?

    I really hate to hear that PubChem is going to change there formats so often. Also, I do not understand why those changes in the ASN.1 tree would not apply to changes in the binary format too...

  7. Hi Egon,

    I am not sure what Wolf is referring to here. The PubChem file format for structures is stable. Yes, PubChem has extensions for some aspects that do not translate (well) into SDF... but even those can be ignored (unless you want that information, e.g., ionic bond vs. covalent bond, etc.) for typical organic molecules.

    ASN.1 is a binary format. NCBI has a text-based flavor that allows it to be easily viewed... but there is no standard text format of ASN.1. The PubChem XML format is a direct (wordy IMHO) translation of the PubChem ASN.1 specification (i.e., they are equivalent in content). XML by definition is a text format. The SDF format cannot encode all the information in the PubChem ASN.1 schema, as you may well already know.

    I modified the PubChem valence dictionary (and made a couple of other minor changes) as suggested here that will help prevent additional high energy forms of structures (e.g., two positive charges or two negative charges next to each other... {further} disallow particular charges on certain elements like C+4). The number of compound that will affected by these changes is tiny (less than 100 out of over 42M), most with only a single (and common :-) depositor. After a routine reprocessing of the PubChem archive, the CIDs you mentioned will become deprecated (i.e., no longer have any SID associated to them... and made non-visible by inability to search them). The SIDs tied to these CIDs will no longer have any CID associated with them (i.e., they will fail standardization).

    In future, when noticing things like this... please consider to contact and make a report. You could simply copy/paste your blog into the email message... and we will get it (and respond even :-). The key thing here is to make a report! :-)

    We are very interested in useful PubChem suggestions as were highlighted in this blog... the only issue with having them just here is that someone like me has to chance upon it (e.g., via a Google alert) unless you send us an email... Imagine if this is the way you get CDK bug reports (by you reading someone else's blog! :-).


  8. Hi Evan, thanx for your reply!

    actually, blogs is indeed one way we learn about CDK bugs :) And I am fine with that. We recently had such a report about fingerprinting, which after some roundtripping turned out to be a problem in the blogger's code, not in the CDK :)

    The reason why I have not reported any problems yet, is that I am not happy about the accuracy of the reporting yet... and, the 'activation energy' of sending email is larger than that of a bug track system (which would also allow me to track the state of my bug report; highly recommended!)... but, the accuracy was the primary reason.

    The Bioclipse 2.0.0 release is around the corner, allowing me to clean up the CDK atom type list, after which I will rerun the analysis, and I will report my findings then, and in more detail.