Monday, September 30, 2013

OpenTox Europe 2013 presentation: "The Open PHArmacological Concepts Triple Store"

On behalf of the Open PHACTS project, I have today presented the project at the OpenTox Europe 2013 meeting in Mainz. The session was about data management and analysis, and chaired by Dr. Nina Jeliazkova. Actually, I ended up in the same session as my colleague Martina Kutmon who gave a really nice presentation on PathVisio 3. Anyway, my slides looked like this and was based on earlier presentations from Open PHACTS colleagues, Gerhard Ecker and Chris Evelo in particular:

In the afternoon there were workshops where I presented Bioclipse-OpenTox, and particular the scripting side of it. You can read that tutorial here.

Friday, September 27, 2013

Urgent Open Science needs for Drug Discovery: pKa and logP

There is quite some discussion right now on Open Source Drug Discovery, and questions about what is Open Source and what is not. But as I made clear yesterday, I do not think that a project that requires the assigbment of specific rights independent from Open licenses is not the way forward, and in many cases not even possible. In my humble opinion, an #openscience approach is critical. Hiding data pending a (open) patent does not work for me; I'm sorry. Not that I am against patents in general (primarily, the patent system is broken and misused, but the idea has merits)...

Instead, I very much prefer to focus on solutions. Like the CDK, Bioclipse, BODR, CML, and many other Blue Obelisk tools. These tools are enabling drug discovery and research into computational tools to aid drug discovery. Without strings. Not fuzz about having to submit your precious data before you can use these tools. We contribute, we pay forward. And seriously, I love to see a Nature Chemical Biology paper and learn it uses the CDK, even if I am not a co-author on that paper, as much as I could use that in my academic career (or any of the other 75 contributors of the CDK!).

We do get back, beyond that aforementioned satisfaction. We do see other projects donate data, donate tools built on top of the CDK, to further aid the community.

But if you really like to know, here's my wishlist of things that we really urgently need: Open Data for training (statistical) models for chemical properties. In particular, I need CCZero experimental data (annotated with experimental method, error, etc) for:

  1. logP (and/or LogD)
  2. pKa (please use this wiki)
We recently saw such initiatives for melting points and solubility already from Jean-Claude, Andrew Lang, Antony Williams, and others.

If you have data, please make it available as Open Data, by putting it online in a machine readable format, and with proper copyright and CCZero waiver information.

Thursday, September 26, 2013

Why do databases make sharing Open Data difficult?

I tend to feel quite isolated in these matters, but they matter to me: licenses, agreements, etc. Because I try to be a friendly guy and respect the wishes expressed by others.

However, this puts me in a situation where I cannot join many otherwise interesting initiatives. There are many examples, but I will isolate one, for no particular reason other than that they just published an interesting paper about DMSO solubility modeling (doi:10.1021/ci400213d): the Online Chemical Database.

The training data from this solubility study is available from this website, and is listed in the abstract as freely downloadable. Well, free as in free beer. I cannot even look at the data set metadata without signing a license. So, I started reading the license, and clauses like this worry me:
    4.1 The User grants to Helmholtz Zentrum Muenchen by submitting information, data, models and structures to the Online Chemical Environment a world-wide, non-exclusive, transferable and sub licensable right to use all information data, structures and models submitted, for research, teaching and any other (including commercial) purposes.
Originating from an open, academic culture of collaboration, I rarely am the sole copyright owner of a data set. And with my busy agenda I am really not going to chase down all owners and ask them if they are willing to assign these rights to the Helmholtz Zentrum Muenchen. Do you seriously think I have nothing better to do? So, I cannot contribute data to this database. Worse, this clause probably not compatible with Open Data license in general. I fully understand the attention, but you are paying your legal experts probably a lot of money, so let them do their work and explicitly allow Open Data licenses, indicating that any such clauses do not apply to such data.

BTW, comparing this clause to 4.2 is awkward too. Not giving downloaders of data sets uploaded to the database the same rights as the uploader has given you, doesn't sound like being a good citizen.

Now, in no way this data base is unique. Many databases I encounter, all with the best of intentions, come up with legal obstacles. Is that really what you wanted to do?

Changes in CDK 1.6 #1: some removed classes

Because John is doing great work in shaping up CDK master, we're heading towards a new stable release. Time to start writing up all the API changes.

Removal of the nonotify interface implementation

The NoNotificationChemObjectBuilder and the matching implementation classes are removed. Please use the SilentChemObjectBuilder instead.

Removal of IMolecule and IMoleculeSet

The IMolecule interface and all implementing classes have been removed. They were practically identical in functionality to the IAtomContainer interface, except the implication that the IMolecule was for fully connected structures only. This separation was found to be complicated, and was therefore removed. Please use the IAtomContainer interface instead.

Friday, September 13, 2013

First WD of "Encoding units and unit types in RDF using QUDT"

While the editors draft has been online for a while, I have added some more Unit Ontology - QUDT mappings and updated the matching jQUDT library, and now put the first Working Draft (WD) version online on the Open PHACTS specifications website, ready to accept comments (it's primarily informative, rather than normative):

A big thanks to Ralph Hodgson and the other QUDT authors for developing this ontology and in particularly to have it include a framework and the needed information to interconvert units, which we take advantage of! And I am very much looking forward to the webinar of the QUDT team for the Open PHACTS project in some 10 days.

Thursday, September 12, 2013

Why I still prefer CML (was: #ACSIndy formats session)

At the #ACSIndy meeting there was a session one chemical formats (I hope the slides will all come online). Some key tweets (thanx to Tony for the coverage):

But I am still fan of the Chemical Markup Language. In fact, I started using this when XML was not even standardized yet. Even CML has a SGML background. Well, fairly, only months before XML made it into a recommendation, and CML followed. CML is flexible, which to some is a downside; to me it is a big advantage, as it allows me to easily extend it. It support ontologies to do this, and is therefore one of the most machine readable chemical formats.

Of course, a lot depends on the libraries that you are using. For reading, there are various approaches I have taken. Originally, I wrote a library (Willighagen2011) that supported the convention idea in CML, which is a pain to many. This feature is still actively used in Bioclipse and the CDK! Of course, many cheminformaticians do not care too much about explicit semantics, and the community standard is MDL molfile V2000 (someone has exact numbers?), even though the improved V3000 update is already 30 years old (see the first tweet!).

Of  course, browsing through all tweets, I think the session nicely showed some of the newer requirements, many required the extensions presented in this session. These extension may have been part of the original specification (is there an overview of specification documents of all industry standards?), but in many cases these will also be conventions. E.g. a common convention used by cheminformaticians is to use the bond order type 4 in MDL V2000 molfiles to reflect aromaticity, even though the specification defines it differently.

I hope all specifications of these updates and conventions will find their way to the web, with at least the rights to redistribute, allowing independent tools to properly implement these standards. (The right of modification is debatable for standards.)

Willighagen, E. L., 2001. Processing CML conventions in java. Internet Journal of Chemistry 4, 4+.

Sunday, September 08, 2013

Next Bioclipse-OpenTox Workshop gig: Mainz, September 30

Ola and I will be giving a Bioclipse-OpenTox workshop in Mainz on Monday 30 September, during the OpenTox EU 2013 meeting. Places are filling up quickly (really :), so sign up now if you like to learn your way around in interactively accessing chemical liabilities.

The focus will likely be on the graphical user decision support interface, so here's what you would be able to do with scripting:

This workshop would also be great if you like to learn on how we use RDF for all of thise!