Pages

Sunday, March 09, 2008

The Chemical Object Identifier; or, the freedom to identify chemicals

IUPAC chemical names, SMILES and InChIs are too long. InChIKeys are not unique enough because of safety reasons (you have a 1 in 10 billion chance of blowing up your building; well, odds are actually much, much lower than getting hit by Osama or friends, let alone a car). Wikipedia URIs do not cover enough chemical space.

However, we need short identifier. Why, actually? Computers don't care about long identifiers. Systems can be integrated. A web link is easy to make. But we do. A bottle on the shelf does not have a HTML interface. And you do not have a scanner to read the chemical structure from a 2D barcode (see DOI:10.1021/ci049758i).

The CAS registry number has serviced this purpose for a long time. For example, as used on bottles visible in this picture (copyright: CC BY-SA, Science in the Open):

Now, when Anthony reported that CAS, the organization that builds the proprietary lookup service, which has done an amazing job in the past, that they do not wish to see CAS numbers in Wikipedia curated by means of the official database - it violates the end user agreement one has to sign before one can use the database - the blogging community reacted (here, here, here, here and here).

Personally, I agree with the CAS standpoint. It's been a proprietary database which people have been supporting financially for years, and thoughtfully signed the license agreement. So, don't complain afterwards. If you really want to, end the agreement and object against the license. I commented in the original blog:
    In 1995 I started a Dutch website on organic chemistry [1] and the CAS number was as useful as it is now, and already then we knew we were not allowed to compose a database of CAS numbers. Not sure about the legal state of that, but our university had a license; not sure if students had access, but do not believe so. Anyway, building a substantial list of CAS number was not allowed. So, we looked for other means of identifying molecular structures, which led us to CML… this was around ‘96-’97 or so, at least before XML was released, and we started using CML actually when it was still in a more obscure SGML format :) Yeah, the XML recommendation was much appreciated!

    OK, so back to your blog item. You can imagine that the comment in WP by CAS does not surprise me at all; nothing really new. If they would allow this, it would set a precedence…

    The solution is, however, fairly easy. Use InChI(Key), PubChem CID, or ChemSpider CID; the latter two are on the same level as CAS numbers. CAS registry numbers are overrated. Not sure if they still hand out CAS numbers to mixture too… (I guess not).

    Oh, and I agree with Cpt. Renault… people should really abide to legal requirements. Period. If you don’t like them, quit the legal agreement. As simple as that.

    1.http://www.woc.science.ru.nl/

Here, I tend to disagree with Will who wrote that They are just numbers. i.e. descriptors. The CAS number only makes sense with a (curated) look up table; making it tightly linked to the CAS database. While theoretically you may be allowed to copy numbers from that database, the license agreement strictly disagrees with that. Court would have to decide which right takes higher importance, but my vote is on the agreement, which you thoughtfully signed. So, I tend to agree with Joerg who wrote that CAS number are not public domain, are they?

An interesting bit in that blog item is the comment he left himself:
    I just realized that Peter has also commented on it. And storing 10000 CAS numbers and structures is allowed? What happens, if a journal reaches this limit? Just imagine they publish 1000 papers with 100 CAS numbers for each article? I do not get this!

Interesting indeed. This gets me back to a recent question I was confronted: How would I use chemical literature in the current age? Well, what about this hypothetical Taverna workflow:
  • Node 1: get me a list of journals expected to contains CAS registry numbers (such as the JCIM)
  • Node 2: for each, get me all publications of the last 25 years
  • Node 3: process all articles and count cited CAS registry numbers per journal
  • Node 4: complain if count_per_journal > 10000

Anyway. Common agreement seems to be that we can opt to do without the CAS registry number. The PubChem ID seems a reasonable candidate, and has been suggested here and here. The ChemSpider ID could be an option too, though ChemSpider content is periodically added to PubChem.

I'd also like to bring in the suggestion of having a Chemical Object Identifier: like the DOI, the COI is a simple alpha-numerical identifier, with a one-to-one connection to the InChI, and unlike the InChIKey unique as the InChI itself, but requiring a look up service. And the latter I can offer: http://rdf.openmolecules.net/. It's a free (as in Open) resource, where we can provide this lookup service. It would be really easy to create a new COI when a InChI is passed it did not assign a COI yet. A PHP page to do the reverse lookup is easy too. Interested? I can have it going by the end of the month. It comes with full RDF support, so ready for the Web-NG.

8 comments:

  1. How exactly is the COI an advantage over InChI? Also since you mention that COI will be as unique as InChI strings, yet you say that it will be a simple alphanumeric identifier, does this mean that COI will be shorter than an InChI? If so, how can you ensure that a COI will be as unique as the InChI? It looks like COI would face the same problem as InChI key (basically, you're compressing the information content of the original string in both cases)

    Or am I missing something?

    ReplyDelete
  2. The COI could actually look like 1111111111-A, where the last one is a check digit, starting with 10000-K. The last being a simple checksum. They are autogenerated.

    Getting a COI assigned would be easy: just request the InChI at rdf.openmolecules.net, which will pick the next free one for each new InChI.

    ReplyDelete
  3. But what type of hashing algorithm are you using to generate the COI? How is it more unique than InChI keys?

    ReplyDelete
  4. So...a tinyurl for chemicals, right?

    ReplyDelete
  5. Rajarhis, there is no hashing involved. Just counting. Something like: "He, I haven't seen that InChI before! Ok, what was the last number handed out? Add one to that, make a mental note, and hand out this number."

    Clearly, this mental note will be a database.

    The dx.doi.org equivalent URI, would be something like rdf.openmolecules.org/?10000-K, but the system and the data would be open, so anyone could set up a registry mirror. Obviously, registration would be done at one server, likely maintained by the Blue Obelisk.

    ReplyDelete
  6. Hmm, interesting. But won't this lead to unintelligable long strings just like InChI?

    But in the end isn't this fundamentlly the same as InChI key. I assume that you're considering SERIAL types wrt the database - which have a finite (if very large!) range. So at one point wrap around would occur. Which is effectively the same as a hash collision

    ReplyDelete
  7. Having a service is one thing, having people using it another.

    Three more aspects on this

    1. I was astonished by the critical and very honest comment of Antony that he think the ChemspiderID (CID) is not ready as global identifier, yet! I think they have already a great service running, but I must admit that constency is a key, and ChemSpider is still not in full production mode, it is still in beta status. And I think I must agree with him in one point. Whatever service is provided the question is how stable is it and what its long-term perspective? Having said that, I think this should be done on a larger scale with some backup from some larger societies, as suggested by Antony, like IUPAC, ACS, or CAS. Do not get me wrong here, I completely agree on the concept, but a large buy-in and a long-term stability is important. So, this would require first some mails, phone calls, and discussions. Beside would I still rather like that ChemSpider would provide such a service, of course with all input people could give.

    2.The good thing about DOI's is that you have something can identify entries by "journal/article", like InChIKey's are something like "structure/stereoinfo". I agree that a low collision rate of any hash-key can not be beaten by a unique identifier. As said, I would go first for getting more opinions from some larger database vendors or publishers on this topic.

    3. I think the TinyURL concept might come close, but what then? Do we just get Yet Another Identifier? Here, I must again agree with Antony, if you want something reliable you have to speak to the ACS/CAS guys. You may like it or not, but this is the *only* reliable identifier and structure source at the moment. Whatever we as community want to do, we have to compete with the actual market leader on this. If there is no buy-in from some larger trustworthy sources there is no point in starting this excersise.

    ReplyDelete
  8. Relative to Joerg's comments :

    1. Regarding the ChemspiderID (CID) being ready as a global identifier. I think the biggest issue is community acceptance. Early negative comments about ChemSpider put us in a very bad situation and we are working to earn our stripes. I think we deserve them but that's irrelevant if others don't. The system is stabilizing but we are still on home-based servers and not on a 24/7 system yet at the mercy of ISP and power outages.


    2) Regarding "ChemSpider is still not in full production mode, it is still in beta status." it comes out of beta on the anniversary of release, or within a few days of that. So...by April 1st we rwill remove the beta label.

    3) regarding "what its long-term perspective?". Only time (and resources) will tell I am afraid.

    4) regarding "I still rather like that ChemSpider would provide such a service, of course with all input people could give.". Me too. I'd like to get support to do so.

    5) Regarding "I think the TinyURL concept might come close, but what then?" I like TinyURL too. We've considered it. it's not difficult. And, with the concept of an InChIKey lookup/resolver (http://www.chemspider.com/blog/we-need-an-inchikey-resolver-and-we-need-it-now.html)
    it can be done there. The resulting dataset would need to be Open and available. We are already working on the InChIKey resolver project. This is an additional layer but "should it be done"? Up for discussion.

    6) Regarding "if you want something reliable you have to speak to the ACS/CAS guys. You may like it or not, but this is the *only* reliable identifier and structure source at the moment." I want the walls to come down and work on building relationships and halting what has become, in many ways, a stand-off....

    ReplyDelete