Wednesday, February 24, 2010

Dutch contribution to the Crystallography Open Database at risk?

The Dutch company PANalytical has made their HighScore software available (some details in this README) for use in the Crystallography Open Database.

ICDD is rumored not to be amused by the contribution of the HighScore-based search functionality, and rumored to be claiming breach of intellectual property. I have not seen either any ICDD patents nor the HighScore implementation, but clearly there is a conflict of interest.

BTW, Panton Principle endorsers may be interesting in signing the petition for Open Data in crystallograph too.

Monday, February 22, 2010

What IF or Article Level Metrics does not tell you...

This weekend there was the really nice Science Commons Symposium, which I virtually attended, and there is an interesting discussion at FriendFeed on article level metrics.

Now, I just reported on the CDK functionality used in published research. Linking this to impact, the CDK with 115 citations now (both papers, nice increase from 2006) is not doing bad. But the real impact goes further than the direct citations. The BRENDA enzyme database is one of the project where CDK functionality is (was?) used, and the matching papers (doi:10.1093/nar/gkh081 and pmid:11752250) have been cited 241 times. Surely, BRENDA does very much more than just the used CDK functionality.

But, in my opinion it does something about the impact of the CDK too. What do you think? Should these counts be included in the article level metrics too? I am almost tempted to even pose that those counts are more interesting that the number of blog replies...

Further statistics on the papers citing the CDK

I already gave a wordle of the titles of papers citing the first CDK paper. Below follows some additional statistics: the number of papers that use a particular CDK package (51). Now, this numbers are a bit rough, and surely any paper that uses the CDK is bound to use the IO or SMILES package too. Additionally, for 10 papers I was not sure what CDK functionality they used, so I assigned those to the root package.
org.openscience.cdk.qsar: 12 (~20%)
org.openscience.cdk: 10
org.openscience.cdk.fingerprint: 9
org.openscience.cdk.isomorphism: 6
org.openscience.cdk.similarity: 3
org.openscience.cdk.smiles: 2 2
org.openscience.cdk.model.builder3d: 2
org.openscience.cdk.ringsearch: 2 2
org.openscience.cdk.render: 1
org.openscience.cdk.structgen: 1
org.openscience.cdk.graph.matrix: 1 1
From this we learn what parts of the CDK are used. From the various CDK Literature blogs (#1, #2, #3, #4, and #5) I already knew the the descriptor calculation was much used, as well as the fingerprinter and the isomorphism checker which also provide the maximum-common substructure functionality. What I was not aware of, is the our 3D model builder had been used in published studies too, which was a pleasant surprise.

These numbers are based on 51 papers where CDK functionality was used, but you may be aware that Web-of-Science has 84 papers citing the first CDK paper. Of these, only 78 are actually in their database (which I don't quite understand). Also, at least some 10 papers cite the CDK, but do not use it, and a few papers cite the CDK where they actually use Jmol. I also have to say, that for a curated citation database, I too often have to send in bug reports, but I cannot estimate to what extend that affects these numbers.

What does effect these numbers, is that some papers do not explicitly cite the CDK through one of the two papers, but only the website, or not at all (yes, that happens, but it nicely balances out with papers citing the CDK but using Jmol :).

Well, I'm curious what the statistics will say about the second CDK paper, and the JChemPaint paper which is based on the CDK too.

Sunday, February 21, 2010

Wordle of the titles of the 20 most recent papers citing the first CDK article (as identified by WoS)

This is the Wordle after I analyzed all papers citing the first CDK 1 paper:

Clearly, NMR is now less important, though it is indeed overall one of the more important use cases of the CDK. Chemical and molecular are important terms, and considering the molecule is the primary use case of the CDK right now.

Friday, February 19, 2010

Open Data: the Panton Principles

The announcement of the Panton Principles is the big news today, though Peter already spoke about them in May last year (see coverage on FriendFeed and Twitter). The four principles list in their short versions:
  1. When publishing data make an explicit and robust statement of your wishes.
  2. Use a recognized waiver or license that is appropriate for data.
  3. If you want your data to be effectively used and added to by others it should be open as defined by the Open Knowledge/Data Definition – in particular non-commercial and other restrictive clauses should not be used.
  4. Explicit dedication of data underlying published science into the public domain via PDDL or CCZero is strongly recommended and ensures compliance with both the Science Commons Protocol for Implementing Open Access Data and the Open Knowledge/Data Definition.
I think these are very workable next steps in Open Date, perhaps even worthy end goals. I endorse them.

Principle 1: an explicit and robust statement
This is in my opinion the most important principle. Too often you find a database with really useful data, but without any clue about what you are allowed to do with this data. Of course, I can contact the authors, get their permission, etc. They probably like it that way, and I can even understand that. However, it does not scale, and it is slow. Even worse is the situation when the original composer gets missing in action. Both are equally valid, but explicit statements just make things easier.

Principle 2: use a waiver or license appropriate for data
This principle is debatable. Very much like the BSD-vs-GPL flamewars, some like copylefting, others do not. There is an important difference though. Software has the concept of interfaces, allowing to more easily share incompatible licenses cleanly separated by these interfaces. This, for example, allows you to run proprietary software on a Linux kernel. However, data sets do not have such a concept. There is not such thing as an interface between two numbers.

This makes the concept of mixing data sets different: because there is no such interface, any mixing can only happen between compatible licenses. This is one reason behind the choice of very liberal licenses like CC0. This license, or waiver really, allows you to do anything, and most certainly, mix data sets.

And that makes things a lot easier. But then again, while these are nobel goals, I rather see people use a copylefting licenses than no license at all.

Principle 3: non-commercial and other restrictive clauses should not be used
I think again making things easier is the goal. The non-commercial clause is interesting, and actually likely an important one. Consider course material, a course book. Those are commercial. Some even argued that many universities themselves are actually commercial entities.

Principle 4: the public domain via PDDL or CCZero is strongly recommended
I second these choices over a mere claim claim that the data is public domain. The PD concept has many meanings and not the same in every jurisdiction. In particular, differences between USA and EU law. Waiving these right, which is just the same as claiming public domain, works in any jurisdiction, again, making things a lot easier.

Open Data, Open Source, Open Standards are not goals
The underlying pattern of my comments must be clear: the principles make life easier. This is all what Open Source and Open Standards (whatever those are).

    The three pillars of the ODOSOS mantra is not goals, but merely the means of making life easier.

The Panton Principles certainly make life easier in Open Data, and initiative like the Linking Open Drug Data in which I participate will greatly benefit from people adopting them.

The Principles do not solve all problems. There is still a lot of 'Open Data' licensed with unrecommended licenses. For example, the NMRShiftDB uses a GNU FDL license, and data from supplementary material of Open Access journal articles is like Creative Commons.

Another related initiative should certainly not go unnoticed either: Is it Open Data? is a service where you can try to resolve what the license is for one of those databases which is not quite Panton Principles compatible yet.

OK, one last thing. The Dutch government is bursting, and I want to listen to the music. With permission, I have been hacking the Panton Principles endorsement page, and injected some extra span elements, to make it easier to machine process (again, to make things easier), so you can use the following one-liner to calculate the number of people endorsing the principles:
    $ wget -O endorsed.html; xpath -q -e "//span[@class='signature']/span[@class='Country']/text()" endorsed.html | sort | uniq -c
The current count is hitting 44 now, and has not quite reached the 500 I had hoped for yet:
1 Australia
      1 Canada
      1 Catalonia
      2 Espana
      2 France
      6 Germany
      1 Greece
      1 Italy
      1 Netherlands
      1 New Zealand
      1 Norway
      1 Poland
      1 Slovenia
      1 Sweden
      1 Switzerland
      1 The Netherlands
      9 UK
      1 U.K.
      1 United Kingdom
      1 United States of America
      9 USA
Anyone knows how we can convert this into some nice world map graphics with a few lines of code?

Now, I am looking for a bar in Uppsala to write up some ideas about what specifications are :)

Thursday, February 18, 2010

Citing the Chemistry Development Kit

Two weeks ago, a paper by Peter Ertl was published about Molecular structure input on the web (doi:10.1186/1758-2946-2-1). In this paper, he discusses the state of things and describes his contribution to this field, the JME Molecule Editor. The article also cites the CDK, but only the website and not one of the two papers (doi:10.1021/ci025584y, or doi:10.2174/138161206777585274). This is not an isolated case, but a common pattern. In principle, the proper work is cited, and nothing is wrong. Practically it means, that a citation to the CDK website does not show up in the citation network. This is not a problem caused by these papers, but merely by the nature current citation databases work: they only count citations between journal articles, and only sometimes extend to books or conference abstracts.

Now, addressing the limitations of the current citation databases is technically simple, and purely blocked by social and commercial aspects. The Citation Typing Ontology by David Shotton defines the framework to define citation types, independent from any existing database. The semantic web technologies will take it from there, and allow aggregation etc.

There are some things to think about on how to use such citation networks, though. If we calculate the impact of the CDK project, we should combine citation counts to the website(s), papers, etc, after removal of duplicates, etc. The cito:cites does link to resources, and the CDK paper resources is not the same as the CDK website resource. But, we could define a Project Class, where both are foo:partOf. Then, we could define that the triple chain the:citingWork cito:cites the:CDKArticle foo:partOf the:CDKProject would imply the triple the:citingWork cito:cites the:CDKProject.

Typed Citations
Now, while writing up this blog, I realize that my fork of this morning, A BIBO Citation Typing Ontology, might actually be counter-productive in the long run, as I was only working out a solution to a simpler, but different problem, which the CiTO also addresses: a citation is not typed. When a paper does cite the CDK paper, we still do not know if it uses the CDK, or merely mentioned it as related-but-unused, or even refuted work.

Now, as I am leaning towards the Biobliography Ontology as RDF-based system for my references, and been using this already in the RDF store hosting the ChEMBL data, I forked the CiTO to define rdfs:domain and rdfs:range on bibo:Document. The CiTO 1.5 actually defines a large set of document types too, and I rather see BIBO reused.

This indeed has the downside that the bibocto:cites cannot be used for the above chaining, and this might bite me seriously later. Well, nothing wrong with a failing experiment, right? For now, it will serve my purpose: setting up a citation database for the CDK project papers.

The CDK citation database
So, here goes (it's RDFa-enabled; check this RDF pulled out):
@prefix bibo: <>.
@prefix bibocto: <>.

<urn:doi:10.1186/1758-2946-2-1> a bibo:Article ;
  bibocto:cites <urn:doi:10.1021/ci025584y> .
I am not entirely happy about the error-prone XHTML+RDFa of the above example, and filed a question of better solution on SemanticOverflow.

While the above example merely defines the citation of Peter Ertl's article to the CDK (whether that is valid or not... would he have cited the other paper perhaps?), the citation typing allows me to state how the CDK paper is cited. Now, Peter states:
    It is also gratifying to see the advent of open source movement in cheminformatics on the Internet, as advocated for example by the Blue Obelisk Group (40) and witnessed by collaborative projects like Chemistry Development Kit CDK (41), Jmol (42), Bioclipse (43) and several others.
So, I think it is fair to state that:
<urn:doi:10.1186/1758-2946-2-1> bibocto:credits <urn:doi:10.1021/ci025584y> .
which is very much appreciated!

Tuesday, February 09, 2010

ChEMBL RDF #1:SPARQL end point

In a series of SPARQL end points, I am happy to present a new Virtuoso 6.1-hosted SPARQL end point for the ChEMBL database (CC-BY-SA), at our groups new server. The server is hosting 23.8M triples, with the data based on ChEMBL 02. There is a SPARQL end point, as well as a SNORQL interface:

There are 2.4M activities, 34k papers, 517k compounds, 416k assays, associated with 7k targets (proteins etc):

I will not discuss any cool mashups with the rich data set and Bioclipse, as that is the main topic my student Annzi's Masters project.

Sunday, February 07, 2010

RDF, Jena, Bioclipse, Eclipse, Zest: Mashups

Quite a while a go, I blogged about Zest in Bioclipse showing a bit of ONS Solubility data. I could not follow up on that until now, as I had yet to do a lot of RDF work in Bioclipse, so the screenshot back then was kind of a mockup.

Things are different now, and the Bioclipse-RDF functionality (using Jena) is released in Bioclipse 2.2 (see Semantic Web features in Bioclipse 2.2), and I got around to writing the graphical goodies for the following papers. Not submitted yet, but here's the screenshot showing a N3 file opened with a Zest-powered editor (read-only) and a plain text editor:

Average time on Site: chem-bla-ics

Cameron freeded(?) on the Spanish and Dutch sticking around longest on his site. I've never used Google Analytics for that, but it's good time spent on procrastination: it makes nice graphics:

For what it's worth, the two visitors from Cuba spend most time on my site :) More seriously, the information on the site does not include any error bars, such as the standard deviation:

Most of use know that 4:24 for two visitors is not necessarily significantly different from 2:43 for one hundred visitors (actual numbers). Standard deviation information would have helped significantly here (pun intended :).

Friday, February 05, 2010

UU Cheminformatics Journal Club

Following the steps of the IU Cheminformatics Journal Club, I have started a UU Cheminformatics Journal Club:
Hi all,

after repeated questions from various people around asking about some
educational thingy on cheminformatics, and now that I have two
students I want to educate, I am starting a cheminformatics journal
club, as it is nicely called... we'll discuss cheminformatics
literature biweekly.

No worries if you skip or meeting or so, but when you join, you *must*
have read the paper. Additionally, it is expected that you attempted
to understand the paper, by looking up cited references and methods.
That said, you are not expected to spend a week of literature mining
on the paper (depending on the topic and your backgroun, a day or two
at most will do); instead, you are required to form an opinion on the
paper. Preferably, those opinions are written down before the meeting,
to force you to actually formulate them.

During a meeting, we'll discuss the paper then, identifying strong and
week points of the paper, and one of us (rotating) will make notes,
and write up a (public, OA) review on the paper based on our

Everyone is invited to join in. Please do let me know which dates you
will attend. Meetings are from 10-11 am.

* Thu, 11/2: Towards pharmacogenomics knowledge discovery with the
  semantic web, doi:10.1093/bib/bbn056
* Thu, 25/2: Small Molecule Subgraph Detector (SMSD) toolkit,
* Thu, 11/3: Virtual screening of bioassay data, doi:10.1186/1758-2946-1-21

As you see, the list is a mix of various subfields of molecular
chemometrics: the first is around knowledge management (RDF), the
second around chemical graph theory, and the last about statistics
(Bayes, SVM, RF). This covers about the three pillars of current
cheminformatics research.

Update: fixed link.

Wednesday, February 03, 2010

CDK 1.3.2: the changes

I promise I will write up more useful changelogs, and will actually try to do so in the excellent way Bob Hanson has been doing for Jmol: by example. For now, the following will have to do. These are the changes after release 1.3.1, which include all the changes in release 1.2.5:
  • Use the new error reporting IO API fd81efc
  • Added a new IO API for reporting file format errors. ea5f0b7
  • A new test for canonicalLabeler. I first tried in an older checkout, where it failed, but it works in master. I think we can still put the test in, more tests are better. b9db6f1
  • Unit test for bug #2944080 b0666e9
  • Added the atom-atom mapping for all atom containing the reactant molecules 68e696a
  • Removed the bond mapping from the reaction. It will only contain atom-atom mapping functionality aa3511f
  • Initiating only one time the function LonePairElectronChecker e71b52e
  • Added getExampleReactants and getExpectedProducts method for all reaction.type test. 4f5e8a1
  • The IMapping interface had a class comment which probably was a copy&paste artefact. Changed this. 05c857c
  • Fixed license info .meta file for JavaCC d9e15bb
  • Removed bit which explain how to apply the LGPL to source (fixes #2926775) 12e8e4f
  • CDKHydrogenAdder should not attempt addImplicitHydrogen for pseudo atoms in an atom container 7074cf5
  • Added unit test for adding hydrogens to IPseudoAtom, which current causes a NPE a30ca3e
  • MDLV2000Reader throws exception for query bond types 354e93f
  • MDL reading and writing and stereo bond types e335bbc
  • Added a helper method GeometryTools.getRectangle2D() to get the space occupied by an IAtomContainer da488c0
  • Reimplemented shiftContainer(IAtomContainer, Rectangle2D, Rectange2D, double) originally implemented as jchempaint-primary patch 9200bdc4d68dc8f70373a62eaec51357b680d5e6 by Stefan Kuhn: fixing the detection of overlap, and added missing unit tests 50ebfa1
  • Added IO option to allow saving aromatic SMILES 84a44e0
  • Added missing unit testing for the SMILESWriter 7d6a9b6
  • Moved Normalizer into a separate package, in reply to discussion around patch #2905749, making space for a uniform platform for structure normalization: cdk.normalize 4c49c24
  • Attached are some more license files. 47a226a
  • The log4j.jar is version 1.2.15. 834ade8
  • More completed files attached. 9e88243
  • They were incomplete, as many other files still are. 261795f
  • Fixed conflict in LICENSE file due to merge from cdk-1.2.x branch d40a679
  • Added a QA target ae661aa
  • Use local PMD and JUnit reports if available 35550bd
  • Added option to run it on just one module fcdad41
  • Added info for dependencies e4a90b1
  • Created a list, to be able to add license information 5b5e54d
  • Added missing copyright/license header 8dee40d
  • Catch a SocketException when there is no internet 5737371
  • Output where it is working on 7693308
  • Removed empty lines 53e60f9
  • Added initial license information, based on the information sent by Stefan 17b3c0c
  • [PATCH] SSSR Test f651b94
  • Bucky ball test molecule 0724464
  • Patch from Ulrich Bauer regarding ringsearch c3c9110
  • Update code example in JavaDoc reflecting the current API (fixes #2914791) 75b4457
  • Minor fixes for the RasmolColors class. d9b1312
  • New classes for Rasmol color scheme 66ca51f
  • Updated UIT matching for the single atom case so that it correctly handles queries that are plain atom containers bbc8f60
  • Updated fingerprinter to fix bug 2819557. Updated JUnit tests to take into account new fingerprints. Also cleaned up the template extractor code and regenerated fingerprints for builder3d. Also updated the build file to properly include dependency jars for the makefp3d target 6d453a1
  • Added a datafile entry for the standard module to store the VDW radii etc for the periodic table 813f45d
  • Fixed reading of SD file properties e4b7f06
  • Added unit test for a MDL SD file with mutliple data fields 97c2c19
  • Added unit test for data fields to allow to start with '>' (bug #2911300). 8e4161e
  • Added testing that properties are read from test6.sdf d3fe073
  • Updated license info of third party libraries a9c85f9
  • Fixed JavaDoc: added missing period at end of first sentence, removed useless @throws clause, added missing @cdk.bug tag b5b722b
  • Package fixing release: fixed building JavaDoc from source dist e151326
  • Added missing references file to the source dist (full and pure) 1cd2124
  • Removed source folders of Doclets, which are not part of the release, and should not be compiled for JavaDoc generation anyway dc8e5e7
  • Removed java pkg removed by the periodic table patch from the Eclipse project classpath 5c63c90
  • Made the unit test more informative 6ea3b2f
  • Added test case for bug 2819557 1ac7920
  • The AtomType(String) constructor is updated so that only formal charge is set to 0 as indicated in the Javadocs. All other fields are set ot UNSET. Javadocs were updated to make this explicit 8206e95
  • Updated canonical labeler to make use of the PeriodicTable class so that even if an input molecule was not configured we can still get a valid atomic number. This makes SMILES generation a little more robust (cf bug 2898032) 2cb55bd
  • Added OpenJavaDocCheck library (new BSD licensed) and written a custom JavaDoc checks. 02c335a
  • Additional constant 21aa28e
  • added a constant for untyped atoms a51c932
  • Updated to avoid use of deprecated StringBufferInputStream c8ec6e0
  • added a test for single-line inchi with several branches c7c92df
  • the inchi reader was written in such a way that it 1) needed a further line after the inchi=, which was not read, but needed to avoid npes 2) It could only process one branch on a level 3) it required the inchi line to start with INChI, newer versions require InChI= All this has been fixed cb5486c
  • Start angles should be different for different size rings cbdcda7
  • Sorting of containers in a AtomContainerSet 545eda2
  • Added new test class to the module suite a1f427b
  • New comparators for AtomContainer 01e8b62
  • Refactored periodic table element to be a standalone class, so independent of the data module. This is OK, since the class is really just a struct to hold PT data for a given element. As opposed to being a basis of an elemental representation. Also, this class is entirely private to this package, so it doesn't really matter what it is. Updated associated unit tests 27fc004
  • Some minor code clean up a0439ab
  • Updated to remove Symbols and all associated tests and usages. Replaced with PeriodicTable d20efc4
  • Moved PT related tests to their own package. Updated test suites d58c03e
  • Added method (and test) to get symbol from atomic number and also get element counts f93e06f
  • Updated module membership. Also made everything bu tmain PT class package private 044a4ad
  • Moved PT related classes into their own package c7f523f
  • Added a test to MoleculeSetTest, which tests that the clone() does not change the MoleculeSEt 42915c4
The matching authors (though one commit was a patch really by Ullrich Bauer) and the number of commits they made:
    47  Egon Willighagen
    15  Rajarshi Guha
    11  Stefan Kuhn
    10  Mark Rynbeek
     4  Miguel Rojas Cherto
With the obligatory note that the number of commits does not reflect the amount of work involved.

Google Translation in Gmail in action

Journal of Chemical Information and Modeling 50th Anniversary

Received a mail from Wendy Warr that the Journal of Chemical Information and Modeling now is 50th Anniversary old, and that they put up a special webpage:

Congratulations to all current and past editors of the journal, and also all reviewers who helped the journal publish a lot of really good papers! I have been very happy to have been reviewing papers for the journal.

It's also very nice to see two friends from the next generation show up on the list of prolific authors: Andreas and Rajarshi. Of the older generation, I also notice Henry in that list. Congratulations to them too!