Pages

Friday, September 07, 2007

New InChI software beta: license issues resolved and InChIKey

The IUPAC/NIST team made a beta release of the next InChI software release:
    The principal new features of this release are:

    1. A fixed-length (25-character) condensed digital representation of the Identifier to be known as InChIKey. In particular, this will:

      • facilitate web searching, previously complicated by unpredictable breaking of InChI character
        strings by search engines
      • allow development of a web-based InChI lookup service
      • permit an InChI representation to be stored in fixed length fields
      • make chemical structure database indexing easier
      • allow verification of InChI strings after network transmission.


    2. Restructured InChI-generating software that separates key steps in its creation from an input chemical structure file. Among other uses, this allows checking of intermediate results to enable easier testing and development of InChI-based applications.

    3. Bug fixes designed to withstand malicious attempts to attack a Web server by providing a specially designed InChI string input to InChI binaries.

    We would welcome reports of your experiences with this new release and, of course, any problems.


InChIKey
A had heard about the InChIKey extension earlier, and it solves the issue some people have with the InChI: it is too long. Well, molecules can have many atoms indeed. It is important to realize the InChIKey is not a replacement: it simply is not unique. The collision probability is calculated to be rather small, though. But clashes may occur, and sees from the above statistics quite likely for the number of molecules estimated to be drug-like, which is estimated at ~1060. Moreover, these are theoretical probabilities which may not apply to the subset of molecules we actually tend to look at.

Anyway, the InChIKey is not a unique identifier, and never use it as such; that's what you need to remember.

An interesting feature is that addition of a check character, which enables some verification of typos. Nothing said about collision clashes there, which exist too. And the fixed length has its virtues too. That said, it certainly helps as sort of prefiltering. Google does a quite decent lookup of InChIs nowadays, and there is a growing amount of semantic markup of InChIs like use of microformats, as RDF/RDFa, stored in HTML @alt attributes, embedded in PNG images to address the issues of the InChI length.

Two final comments, and I hope Alan, Steve, Igor, Steve and Dmitrii will pick this up:

  1. the InChIKey lost the version layer, which will cause trouble when the InChI moves to a next version (as in InChI=2/.... I would really like to see InChIKey=1/RYYVLZVUVIJVGH-UHFFFAOYAW as key instead.
  2. an online service to validate the key using the check character would be most welcome

LGPL license
Not reported in the above announcement is the fact that this release also addresses a issue brought forward by the opensource community. License ambiguity has been addressed, and it is reported that the release now clearly states the LGPL license in the distribution as well as source code headers. This will make packaging for, for example, Linux distributions possible.

Modularization
One of the reasons why there has not been a Java port developed was the lack of modularization in the InChI software. This apparently has now been added, and I am very interested in reading about the effective modules available now. In particular, the canonicalization is interesting. The resulting atom ordering find its use in chemoinformatics algorithms, and a standard for that is most welcome.

Maybe now is the time to develop a Java version of the software.

14 comments:

  1. I completely agree! Especially on the versioning and the uniqueness.

    Joerg

    ReplyDelete
  2. There are now over 17.6 million InChI keys posted to ChemSpider. We generated these yesterday and posted them this morning.

    As I commented on the blog posting
    http://www.chemspider.com/blog/?p=125
    this does resolve the previous issues in regards to different erectile dysfunction drugs giving bigger SMILES as expected with larger InChIs (http://www.chemspider.com/blog/?p=19). Now the InChI key will be the same length but the size of the SMILE can still vary based on the nature of the chemical structure :-)

    ReplyDelete
  3. Hi Database Guy,

    (weird name :)

    Did you find any clashes? From the statistics I would expect a few clashes (1.3 in 10 million, right...)?

    ReplyDelete
  4. Just set up my Blogger profile and didn't think about the name showing up :-) So, dah-dah...I'm now ChemSpiderman (I'm on the web a lot)

    We will be looking for clashes this week. Higher priorities right now I'm afraid. We owe Joerg an update to the single structure deposition system to try out. It's coming...

    ReplyDelete
  5. Could somebody explain to me how the InChIKey is a better idea than just agreeing a length for an MD5 sum of the whole InChI string? There are standard implementations of MD5 and using the whole string would mean that the version layer was included.

    ReplyDelete
  6. Jim, about including the version layer prior to MD5 calculation has this disadvantage:

    Say way have InChI=1/foo and InChI=2/bar. Say they both create InChIKey=BLA. The key would be identical, and effectively it would be impossible to decide if the key would refer to foo or to bar.

    ReplyDelete
  7. There is still version information included through the flag character. This indicates which combination of isotopic, fixedH and stereo layers were included in the InChI. For version 1, the flag takes values A-H, for version 2, I-P, and for version 2+ Q-X. The full table of values is in the release notes. However, I'd agree that InChIKey=1/... seems like a better way to go about it.

    ReplyDelete
  8. Egon, I understand the problems that arise from hash collisions, but from what you're saying InChIKey is basically a digest anyway. Does including the version layer really increase the chance of a collision considerably?

    ReplyDelete
  9. Sam, thanx for those details. That is useful. Does not allow for so many InChI versions, but that it, I guess, not intended anyway. Yes, I would prefer a much obvious layer indication.

    Jim, it is not so much a problem of decreasing clash probability, as it is a problem of converting the key back to an InChI. I could look up the key in translation tables and find the InChI to which the InChIKey corresponds.

    Now, my worry was that I could not do this, as I overlooked the version info available from one of the chars. So, if this character is in the range A-H, then I should look at a InChI=1/... table, if I-P, then InChI=2/... etc. That should do for a while.

    Using InChIKey=1/... would make the correspondence clearer.

    ReplyDelete
  10. Ah, I see. So instead of InChIKey, how about something like TinyURL for InChIs, where there's a convenient URI for each InChI (short, digest based, non-semantic, conveniently embeddable in text, useful for semantic web etc).

    There would need to be a lookup service to find the URL for each InChI, and doing GET on the InChIURL would return a very small chunk of CML containing the InChI, the InChI in text or a chunk of neatly marked up XHTML that made it clear where the InChI is.

    It has the disadvantage that you couldn't algorithmically calculate the InChIURL from the InChI, but it is convenient in text and has the added benefit of being more useful than a string literal for semantic web applications.

    ReplyDelete
  11. PS... I forgot to add, since there would be a centralized point for assigning InChIURL (or an agreed protocol for dealing with collisions), they would be unique. The problem with InChIKey isn't so much that collisions can happen, it's that you don't know when they'll collide.

    ReplyDelete
  12. Jim, I agree with the URL4InChi, and have proposed in a different blog items to use rdf.openmolecules.net for resolving InChIs.

    This service can easily be extended for InChIKey support, and I will do this shortly.

    BTW, I good estimate of collision properties could be to randomly generate a lot of molecular structures and generate a huge database of InChIKeys. I will try to set something up for that using the CDK next week, when I'll be in Ulm.

    ReplyDelete
  13. Sorry I didn't catch the URL4InChI stuff at the time, Egon, I was away around that time. I would have voiced a preference for making them 'proper' URLs, i.e. including the InChI (or info URI) before the ?

    It will be interesting to see practical collision rates for InChIKey, but I wonder whether the benefits of InChIKey are really needed. I've blogged about this at http://wwmm.ch.cam.ac.uk/blogs/downing/?p=126 (it's a bit long to include here).

    ReplyDelete
  14. A number of web services exposing InChI-related capabilities have been provided this evening at

    http://www.chemspider.com/inchi.asmx

    The services include the ability to search for the appropriate ChemSpider ID based on the InChI string and InChIKey.

    Further comments are available at:

    http://www.chemspider.com/blog/?p=135

    ReplyDelete