## Thursday, February 18, 2010

### Citing the Chemistry Development Kit

Two weeks ago, a paper by Peter Ertl was published about Molecular structure input on the web (doi:10.1186/1758-2946-2-1). In this paper, he discusses the state of things and describes his contribution to this field, the JME Molecule Editor. The article also cites the CDK, but only the website and not one of the two papers (doi:10.1021/ci025584y, or doi:10.2174/138161206777585274). This is not an isolated case, but a common pattern. In principle, the proper work is cited, and nothing is wrong. Practically it means, that a citation to the CDK website does not show up in the citation network. This is not a problem caused by these papers, but merely by the nature current citation databases work: they only count citations between journal articles, and only sometimes extend to books or conference abstracts.

Now, addressing the limitations of the current citation databases is technically simple, and purely blocked by social and commercial aspects. The Citation Typing Ontology by David Shotton defines the framework to define citation types, independent from any existing database. The semantic web technologies will take it from there, and allow aggregation etc.

There are some things to think about on how to use such citation networks, though. If we calculate the impact of the CDK project, we should combine citation counts to the website(s), papers, etc, after removal of duplicates, etc. The cito:cites does link to resources, and the CDK paper resources is not the same as the CDK website resource. But, we could define a Project Class, where both are foo:partOf. Then, we could define that the triple chain the:citingWork cito:cites the:CDKArticle foo:partOf the:CDKProject would imply the triple the:citingWork cito:cites the:CDKProject.

Typed Citations
Now, while writing up this blog, I realize that my fork of this morning, A BIBO Citation Typing Ontology, might actually be counter-productive in the long run, as I was only working out a solution to a simpler, but different problem, which the CiTO also addresses: a citation is not typed. When a paper does cite the CDK paper, we still do not know if it uses the CDK, or merely mentioned it as related-but-unused, or even refuted work.

Now, as I am leaning towards the Biobliography Ontology as RDF-based system for my references, and been using this already in the RDF store hosting the ChEMBL data, I forked the CiTO to define rdfs:domain and rdfs:range on bibo:Document. The CiTO 1.5 actually defines a large set of document types too, and I rather see BIBO reused.

This indeed has the downside that the bibocto:cites cannot be used for the above chaining, and this might bite me seriously later. Well, nothing wrong with a failing experiment, right? For now, it will serve my purpose: setting up a citation database for the CDK project papers.

The CDK citation database
So, here goes (it's RDFa-enabled; check this RDF pulled out):
@prefix bibo: <http://purl.org/ontology/bibo/>.
@prefix bibocto: <http://github.com/egonw/bibo-cto/>.

<urn:doi:10.1186/1758-2946-2-1> a bibo:Article ;
bibocto:cites <urn:doi:10.1021/ci025584y> .

I am not entirely happy about the error-prone XHTML+RDFa of the above example, and filed a question of better solution on SemanticOverflow.

While the above example merely defines the citation of Peter Ertl's article to the CDK (whether that is valid or not... would he have cited the other paper perhaps?), the citation typing allows me to state how the CDK paper is cited. Now, Peter states:
It is also gratifying to see the advent of open source movement in cheminformatics on the Internet, as advocated for example by the Blue Obelisk Group (40) and witnessed by collaborative projects like Chemistry Development Kit CDK (41), Jmol (42), Bioclipse (43) and several others.
So, I think it is fair to state that:
<urn:doi:10.1186/1758-2946-2-1> bibocto:credits <urn:doi:10.1021/ci025584y> .

which is very much appreciated!

1. I will stop trying to think up a perfect ontology to capture citations, and try this one. Let's see how it works out...

2. The lack of a proper citation to CDK here is a social, not technical problem. It's not hard to find DOI targets for software if they exist and you wish to credit them in this way, e.g. a quick google scholar for the CDK gives the article, not the website.

That notwithstanding, the idea of CITO does raise the interesting question of provenance on the semantic web.

urn:doi:10.1186/1758-2946-2-1 bibocto:credits urn:doi:10.1021/ci025584y

Says who? And how does an aggregator of citations know a) whether the asserter is qualified to make the statement b) that the asserter is who they say they are.

Not intractable problems, but they'll need to be addressed as part of a wider solution.

3. Jim, the provenance in my case comes from the fact that I make those claims. Partially, because the paper explicitly cites the paper, or, in the case of the paper by Peter, by me. The provenance comes indeed into play when aggregating data, and will surely be solved (there are various initiatives, e.g. that of the Concept Web Alliance...). I am not going to worry beyond my own citation database for now.