Now, addressing the limitations of the current citation databases is technically simple, and purely blocked by social and commercial aspects. The Citation Typing Ontology by David Shotton defines the framework to define citation types, independent from any existing database. The semantic web technologies will take it from there, and allow aggregation etc.
There are some things to think about on how to use such citation networks, though. If we calculate the impact of the CDK project, we should combine citation counts to the website(s), papers, etc, after removal of duplicates, etc. The cito:cites does link to resources, and the CDK paper resources is not the same as the CDK website resource. But, we could define a Project Class, where both are foo:partOf. Then, we could define that the triple chain the:citingWork cito:cites the:CDKArticle foo:partOf the:CDKProject would imply the triple the:citingWork cito:cites the:CDKProject.
Typed Citations
Now, while writing up this blog, I realize that my fork of this morning, A BIBO Citation Typing Ontology, might actually be counter-productive in the long run, as I was only working out a solution to a simpler, but different problem, which the CiTO also addresses: a citation is not typed. When a paper does cite the CDK paper, we still do not know if it uses the CDK, or merely mentioned it as related-but-unused, or even refuted work.
Now, as I am leaning towards the Biobliography Ontology as RDF-based system for my references, and been using this already in the RDF store hosting the ChEMBL data, I forked the CiTO to define rdfs:domain and rdfs:range on bibo:Document. The CiTO 1.5 actually defines a large set of document types too, and I rather see BIBO reused.
This indeed has the downside that the bibocto:cites cannot be used for the above chaining, and this might bite me seriously later. Well, nothing wrong with a failing experiment, right? For now, it will serve my purpose: setting up a citation database for the CDK project papers.
The CDK citation database
So, here goes (it's RDFa-enabled; check this RDF pulled out):
@prefix bibo: <http://purl.org/ontology/bibo/>.
@prefix bibocto: <http://github.com/egonw/bibo-cto/>.
<urn:doi:10.1186/1758-2946-2-1> a bibo:Article ;
bibocto:cites <urn:doi:10.1021/ci025584y> .
I am not entirely happy about the error-prone XHTML+RDFa of the above example, and filed a question of better solution on SemanticOverflow.While the above example merely defines the citation of Peter Ertl's article to the CDK (whether that is valid or not... would he have cited the other paper perhaps?), the citation typing allows me to state how the CDK paper is cited. Now, Peter states:
- It is also gratifying to see the advent of open source movement in cheminformatics on the Internet, as advocated for example by the Blue Obelisk Group (40) and witnessed by collaborative projects like Chemistry Development Kit CDK (41), Jmol (42), Bioclipse (43) and several others.
<urn:doi:10.1186/1758-2946-2-1> bibocto:credits <urn:doi:10.1021/ci025584y> .
which is very much appreciated!
I will stop trying to think up a perfect ontology to capture citations, and try this one. Let's see how it works out...
ReplyDeleteThe lack of a proper citation to CDK here is a social, not technical problem. It's not hard to find DOI targets for software if they exist and you wish to credit them in this way, e.g. a quick google scholar for the CDK gives the article, not the website.
ReplyDeleteThat notwithstanding, the idea of CITO does raise the interesting question of provenance on the semantic web.
urn:doi:10.1186/1758-2946-2-1 bibocto:credits urn:doi:10.1021/ci025584y
Says who? And how does an aggregator of citations know a) whether the asserter is qualified to make the statement b) that the asserter is who they say they are.
Not intractable problems, but they'll need to be addressed as part of a wider solution.
Jim, the provenance in my case comes from the fact that I make those claims. Partially, because the paper explicitly cites the paper, or, in the case of the paper by Peter, by me. The provenance comes indeed into play when aggregating data, and will surely be solved (there are various initiatives, e.g. that of the Concept Web Alliance...). I am not going to worry beyond my own citation database for now.
ReplyDelete