Pages

Sunday, March 23, 2008

CDK Module dependencies #2

A bit over 2 years ago I published a UML diagram showing the dependencies between CDK modules. Since then I lot of new modules have been defined, added or factored out from the extra module (click to zoom):These kind of diagrams help us maintain the library, and apply some design goals, as explained in the first post on this.

If one compares the two diagrams, one sees that fewer code depends on the data module, but it is also clear that still a lot of them do. Another issue that had not properly addressed yet, is that a lot of modules still depend on the extra module, which aggregates everything that had not been assigned elsewhere.

Parallelism
This diagram also helped me use the Ant <parallel> task to allow compiling CDK modules in parallel, instead of sequentially. Multicore machines can take advantage of that, and reduce the overall computation time. Full parallelism is not possible, and it is well visualized by the above diagram that there basically 12 sequential compilation steps in which one or more modules can be compiled. Further clean up of the module dependencies, will reduce this number, and further reduce the computation time on multicore machines.

Now, graph analysis could pinpoint the most troublesome nodes, but it would not surprise me that extra would be amongst them. But the following items are worth looking at too:
  • why does qsar have to depend on charges?
  • why does sdg (the 2D layout code) depends on io code?
  • can isomorphism and formula be made independent of data?
  • why does reaction depend on sdg?
  • why does forcefield depend on qsaratomic?
Some of these issues are rather practical, but it is these kind of analyses that help us clean up the CDK library.

Saturday, March 22, 2008

Be in my Advisory Board #3: JChemPaint widgets?

As promised, I am working on JChemPaint. I have progressed in cleaning up the CDK trunk/ repository by removing traces of the old JChemPaint applet and application. And, importantly, removed the GeometryTools class that took rendering coordinates. The history here is that the original GeometryTools was renamed to GeometryToolsInternalCoordinates, but is now available as
GeometryTools again. I still have to merge Niels' additions with it, though. And, I have set up a new JChemPaint trunk/ where I have moved Niels' demo editor.

Main goal for the next weeks is to further clean up things, and get the new JChemPaint project further up and going. There are, however, some new choices for focus now. Bioclipse needs a SWT widget, the applet would need a Swing widget, and an application could be based on that too, while I could even create a Qt widget, so that in the foreseeable future we can have JChemPaint on our cell phones. So, might my advisory board (that can be you too) take the opportunity to advice me in these matters, and indicate what you would prefer?

The SQT Widget
For Bioclipse mostly. Bioclipse provides a perfect opportunity to replace the old JChemPaint appliaction (not applet), with a attractive and powerful GUI.

The Swing Application
Maybe you'd rather see the old JChemPaint application reinstated, with the less attractive Swing-based GUI. I'd really suggest the Bioclipse approach, so if you pick this option please explain in the comments of this item why I should do this.

The Qt Widget
The Qt lib comes with Java support, and this might be an interesting alternative. Besides being able to make an Qt-based application, the widget would also make it easier to port JChemPaint to the cell phone and to the KDE desktop.

The Applet
The applet is important, and requires a Swing or AWT widget. Personally, I'd rather focus on the SWT widget first, as that is a place where no good alternative is available. On the applet side, we compete with the JME applet and Rich' nice applet.

I do intend to provide an applet version, but this request for advice is for setting priorities.

Wednesday, March 19, 2008

My FOAF network #5: SPARQL-ing my network

FOAF rulez: it's RDF. With RDF comes SPARQL. SPARQL needs a query engine, however. And there comes OpenRDF which created Sesame. I have to catch the train in about 15 minutes, so will not elaborate too much, but here are some Sesame 2.0.1 work:
> create native.
Please specify values for the following variables:
Repository ID [native]: foafRepo
Repository title [Native store]: FOAF Repository
Triple indexes [spoc,posc]:
Repository created
> open foafRepo
Creates me a new RDF storage and opens it.
foafRepo> load http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf .
Loading data...
Data has been added to the repository (606 ms)
Loads my FOAF file. Now, a simple SPARQL query that finds me all friends that now someone with the nick egonw:
foafRepo> sparql

BASE <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?s ?o
WHERE { ?s foaf:knows ?o ; foaf:nick "egonw" . }

.
Evaluating query...
+-------------------------------------+-------------------------------------+
| s | o |
+-------------------------------------+-------------------------------------+
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#HenryRzepa>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#CarstenNiehaus>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#RajarshiGuha>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#JeanClaudeBradley>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#GeoffHutchison>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#ChristophSteinbeck>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#PeterMurrayRust>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#TobiasHelmus>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#StefanKuhn>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#MartinEklund>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#JohannesWagener>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#JarlWikberg>|
| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#me>| <http://blueobelisk.sourceforge.net/people/egonw/foaf.xrdf#JeromePansanel>|
+-------------------------------------+-------------------------------------+
13 result(s) (15 ms)
Not very pretty, but rather accurate.

More SPARQL fun later. Do try this at home, but make sure to not put a period at the end of a line in your SPARQL query! :)

Tuesday, March 18, 2008

My FOAF network #4: Tabulating my publications

Richard informed me (via Planet RDF) about N3 support in Tabulator. N3 is a more compressed version of RDF/XML, which I have been using so far, but both are RDF. Now, I don't plan to use N3 for my FOAF experimenting, but two things caught my eye in the nice blog item.

First, it has a very useful tip on .htaccess which you can use to teach Apache about MIME types, even when you do not have root access. So, I added this .htaccess file to blueobelisk.sourceforge.net/people/egonw/:
AddType application/rdf+xml;charset=utf-8 .xrdf
Now, you can also access my FOAF file with the MIME type set to application/rdf+xml. And, my bibliography too. Now, the latter becomes interesting when you have Tabulator installed in your Firefox. Instead of applying the XSLT, Firefox will now show it like this:

And, in the under the hood mode it looks like:

Now, my FOAF file does not seem to work well. Not sure what goes wrong there, but given the fact that Tabulator seems to be able to recurse into referenced RDF files, I think it nicely complements what we already have.

Wow, it seems Web3.0/WebNG is really going to happen this year!

Tuesday, March 11, 2008

Sugammadex: the molecular condom

Two things I like blogging: 1. the turn-over of information; 2. the informal nature. There are more. The turn-over is optimized by commonly: 1. short blog items; 2. easily allows scanning tons of headlines; 3. often full of links if you want to know the details.

Today, my eye was caught by Sugammadex Buzz for Organon over at Lamentations on Chemistry. The reason was Organon, which is just around the corner here. They had news about a new drug.

Getting to the second reason, I like the informal nature. Just to make sure I checked the press release, but it was really Gaussling that called Sugammadex a molecular condom. This is funny for (at least) two reasons. First, it points (intentionally?) to the birth control drugs of Organon; second, it is right on with how the drug works.

Monday, March 10, 2008

My FOAF network #3: My publications

As promised, I'll write a bit about using Bibliographic Ontology Specification (BIBO) over as bibliontology.com. I have written a basic XSLT to create a HTML GUI (open the RDF source in e.g. Firefox). Really basic: it only converts articles, and even assumes some conventions I found in examples in the BIBO wiki. I have not spotted a BIBO validator yet, so guessing a bit. The BibTeX mapping examples are under discussion, but provide some insight to those who are used to using that (JabRef users, for example).

So, if I understood the specs enough, the following should be valid BIBO (at least it is valid RDF):
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"
href="http://blueobelisk.sourceforge.net/people/egonw/bibo2xhtml.xsl"
?>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
xmlns:bibo="http://purl.org/ontology/biblio/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
>

<bibo:Journal rdf:about="urn:issn:1471-2105">
<dc:title>BMC Bioinformatics</dc:title>
</bibo:Journal>

<bibo:Article rdf:about="http://dx.doi.org/10.1186/1471-2105-8-59">
<dc:title>Bioclipse: an open source workbench for chemo- and bioinformatics</dc:title>
<dc:date>2007-02-22</dc:date>
<dc:isPartOf rdf:resource="urn:issn:1471-2105"/>
<bibo:volume>8</bibo:volume>
<bibo:doi>10.1186/1471-2105-8-59</bibo:doi>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Ola Spjuth"/></bibo:contributor>
<bibo:position>1</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Tobias Helmus"/></bibo:contributor>
<bibo:position>2</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Egon Willighagen"/></bibo:contributor>
<bibo:position>3</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Stefan Kuhn"/></bibo:contributor>
<bibo:position>4</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Martin Eklund"/></bibo:contributor>
<bibo:position>5</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Johannes Wagener"/></bibo:contributor>
<bibo:position>6</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Peter Murray-Rust"/></bibo:contributor>
<bibo:position>7</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Christoph Steinbeck"/></bibo:contributor>
<bibo:position>8</bibo:position>
</bibo:Contribution>
</bibo:contribution>

<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor><foaf:Person foaf:name="Jarl Wikberg"/></bibo:contributor>
<bibo:position>9</bibo:position>
</bibo:Contribution>
</bibo:contribution>

</bibo:Article>

</rdf:RDF>
There are some things notable about this markup:
  1. It is very verbose, even for XML standards!
  2. It's RDF from the ground up
  3. it reuses many other ontologies
Particularly, the authors section is very verbose. However, it also nicely reuses FOAF here. This makes it really powerful. For example, I could have used this bit:
<bibo:contribution>
<bibo:Contribution>
<bibo:role rdf:resource="http://purl.org/ontology/bibo/roles/author" />
<bibo:contributor rdf:resource="http://blueobelisk.sourceforge.net/people/egonw/foaf.xml#me"/>
<bibo:position>3</bibo:position>
</bibo:Contribution>
</bibo:contribution>
This would semantically link this publication to whatever information I have on myself published in my FOAF file.

Now, the reason why I have not done this yet, is that the XSLT did not properly load the XML from my foaf file:
<xsl:variable name="foafURI" select="substring-before(bibo:contributor/@rdf:resource, '#')"/>
<xsl:variable name="authorID" select="substring-after(bibo:contributor/@rdf:resource, '#')"/>
<xsl:variable name="foafDoc" select="document($foafURI)"/>
<xsl:value-of select="$foafDoc//foaf:Person[@rdf:ID=$authorID]"/>
The XSLT processor xsltproc (version 1.1.22 on Ubuntu 8.04) gives this error: warning: failed to load external entity "http://blueobelisk.sourceforge.net/people/egonw/foaf.xml". But, if I make it a relative, it does work. Both with xsltproc as well as with Firefox online.

Another reason not to do it like that, is that one looses control of the citation content. What I will do soon, is use this set up, making researcherid.com obsolete (see also these three blogs):
<bibo:contributor>
<foaf:Person foaf:name="Egon Willighagen">
<rdfs:seeAlso rdf:resource="http://blueobelisk.sourceforge.net/people/egonw/biblio.xml#me"/>
</foaf:Person>
</bibo:contributor>
Just in case you are wondering, "why the *** does he not simply use BibTeX?", the answer is RDF. No RDF, no SPARQL, no GLORY. Just thing how easy it will become to run a queries like:
  • which of those I have published with, run a blog
  • which of those I have published with are going to that conference in Boston in September?
  • which of those I have published with have friends who published about topics around these keywords
  • etc...
All that becomes very easy now.

BTW, this is how I link to my bibliography from my FOAF:

<foaf:publications rdf:resource="http://blueobelisk.sourceforge.net/people/egonw/biblio.xml"/>

Sunday, March 09, 2008

The Chemical Object Identifier; or, the freedom to identify chemicals

IUPAC chemical names, SMILES and InChIs are too long. InChIKeys are not unique enough because of safety reasons (you have a 1 in 10 billion chance of blowing up your building; well, odds are actually much, much lower than getting hit by Osama or friends, let alone a car). Wikipedia URIs do not cover enough chemical space.

However, we need short identifier. Why, actually? Computers don't care about long identifiers. Systems can be integrated. A web link is easy to make. But we do. A bottle on the shelf does not have a HTML interface. And you do not have a scanner to read the chemical structure from a 2D barcode (see DOI:10.1021/ci049758i).

The CAS registry number has serviced this purpose for a long time. For example, as used on bottles visible in this picture (copyright: CC BY-SA, Science in the Open):

Now, when Anthony reported that CAS, the organization that builds the proprietary lookup service, which has done an amazing job in the past, that they do not wish to see CAS numbers in Wikipedia curated by means of the official database - it violates the end user agreement one has to sign before one can use the database - the blogging community reacted (here, here, here, here and here).

Personally, I agree with the CAS standpoint. It's been a proprietary database which people have been supporting financially for years, and thoughtfully signed the license agreement. So, don't complain afterwards. If you really want to, end the agreement and object against the license. I commented in the original blog:
    In 1995 I started a Dutch website on organic chemistry [1] and the CAS number was as useful as it is now, and already then we knew we were not allowed to compose a database of CAS numbers. Not sure about the legal state of that, but our university had a license; not sure if students had access, but do not believe so. Anyway, building a substantial list of CAS number was not allowed. So, we looked for other means of identifying molecular structures, which led us to CML… this was around ‘96-’97 or so, at least before XML was released, and we started using CML actually when it was still in a more obscure SGML format :) Yeah, the XML recommendation was much appreciated!

    OK, so back to your blog item. You can imagine that the comment in WP by CAS does not surprise me at all; nothing really new. If they would allow this, it would set a precedence…

    The solution is, however, fairly easy. Use InChI(Key), PubChem CID, or ChemSpider CID; the latter two are on the same level as CAS numbers. CAS registry numbers are overrated. Not sure if they still hand out CAS numbers to mixture too… (I guess not).

    Oh, and I agree with Cpt. Renault… people should really abide to legal requirements. Period. If you don’t like them, quit the legal agreement. As simple as that.

    1.http://www.woc.science.ru.nl/

Here, I tend to disagree with Will who wrote that They are just numbers. i.e. descriptors. The CAS number only makes sense with a (curated) look up table; making it tightly linked to the CAS database. While theoretically you may be allowed to copy numbers from that database, the license agreement strictly disagrees with that. Court would have to decide which right takes higher importance, but my vote is on the agreement, which you thoughtfully signed. So, I tend to agree with Joerg who wrote that CAS number are not public domain, are they?

An interesting bit in that blog item is the comment he left himself:
    I just realized that Peter has also commented on it. And storing 10000 CAS numbers and structures is allowed? What happens, if a journal reaches this limit? Just imagine they publish 1000 papers with 100 CAS numbers for each article? I do not get this!

Interesting indeed. This gets me back to a recent question I was confronted: How would I use chemical literature in the current age? Well, what about this hypothetical Taverna workflow:
  • Node 1: get me a list of journals expected to contains CAS registry numbers (such as the JCIM)
  • Node 2: for each, get me all publications of the last 25 years
  • Node 3: process all articles and count cited CAS registry numbers per journal
  • Node 4: complain if count_per_journal > 10000

Anyway. Common agreement seems to be that we can opt to do without the CAS registry number. The PubChem ID seems a reasonable candidate, and has been suggested here and here. The ChemSpider ID could be an option too, though ChemSpider content is periodically added to PubChem.

I'd also like to bring in the suggestion of having a Chemical Object Identifier: like the DOI, the COI is a simple alpha-numerical identifier, with a one-to-one connection to the InChI, and unlike the InChIKey unique as the InChI itself, but requiring a look up service. And the latter I can offer: http://rdf.openmolecules.net/. It's a free (as in Open) resource, where we can provide this lookup service. It would be really easy to create a new COI when a InChI is passed it did not assign a COI yet. A PHP page to do the reverse lookup is easy too. Interested? I can have it going by the end of the month. It comes with full RDF support, so ready for the Web-NG.

Saturday, March 08, 2008

My FOAF network #2: XSLT for a HTML GUI

Because the ACS meeting where Henry will present something about FOAF in chemistry, is nearing very fast now (here's the first blog it this series), it becomes urgent to beef up the Blue Obelisk FOAF network, now consisting of 7 members. All do now show up in the FOAFExplorer:


Now, to make sure that my FOAF is in order, I set up the regular XML/RDF toolchain, using xmllint to validate the XML and RDF syntax, and XSLT to convert the FOAF to human readable HTML. Using the ?xml-stylesheet? syntax this also provide the basic HTML GUI when accessing the FOAF file using Firefox. BTW, I had to rename the file to make the SourceForge web server aware that the file is an XML file, so that it nicely sets the MIME type.

BTW, I suggest all to validate your FOAF with this RDF validator, because some of us got some work to do to make them valid:
  • Mine is having some encoding issue
  • Henry's has some 8 errors
The others are actually fine.

While the XSLT is getting along quite nicely, I got serious other work to do. The Strigi-based FOAF indexer is sort of working, gets FOAF documents recursively, but I want it to index our publications and presentation slides too. Now, FOAF has a foaf:publications tag, which I thought might be suited. But after chatting with (new) friends on the #foaf IRC channel (the log), it became clear that the scope of that element is to point to some other file (foaf:Document) which lists the publications, such a HTML output created from BibTeX.

That is, the following syntax is not quite what appears to be intended:
<foaf:publications>
<foaf:Document rdf:about="http://dx.doi.org/10.1186/1471-2105-8-59">
<dc:title>Bioclipse: an open source workbench for chemo-
and bioinformatics</dc:title>
<dc:author rdf:resource="#me"/>
</foaf:Document>


The Bibliontology was suggested and seem a rather good candidate to draft a separate but RDF/OWL-based publication list. The server was down at the time of writing, but the Google cache showed the scope nicely. The Google group is active and the server should go back online shortly.

OK, enough for now. More will follow in this series shortly. Such as a HTML GUI for my publication list in Bibliontology format.

Monday, March 03, 2008

Metabolomics Ontologies: SKOS-ified the ArMet specification

The MetWare project is going to make use of ontology technologies to control the content of the database, and a first step is to convert our MetWare database design into something using a formal ontology language. I have played with OWL in the past (see for example its use in Bioclipse), but was not overly happy with it in all situations.

Then I read about SKOS, Simplified Knowledge Organisation System. Unlike OWL, SKOS is less strict on relations between concepts being marked up. Often these concepts are loosely bound, instead following a strict is_a hierarchy. ArMet is a Metabolomics knowledge system which does not have a strong hierarchy, and SKOS seemed to me to be the most suitable markup candidate. So, I SKOS-ified the ArMet specification, resulting in this rather simple document. The document is SKOS, but has an associated skos2html.xsl XSLT stylesheet, so that Firefox converts it to XHTML on the fly.

An entry looks like:
<skos:Concept rdf:about="GenotypeID">
<skos:prefLabel>genotypeID</skos:prefLabel>

<skos:definition>A unique identifier for the genotype.</skos:definition>
<skos:broader rdf:about="GenotypeProperty"/>
</skos:Concept>

The full SKOS specification allows capturing much of what we want to do, including i18n via the label system, loos hierarchical relations via skos:broader, and the concepts of skos:Collection to aggregate concepts. Where needed, it allows borrowing from other languages. For example, to link concepts from MetWare to the original ArMet specification owl:sameAs can be used.

Saturday, March 01, 2008

Jane, find me interesting journals, please.

Bioinformatics just published a paper from Schuemie and Kors (Erasmus University/NL, BioSemantics group): Jane: suggesting journals, finding experts (doi:10.1093/bioinformatics/btn006):
    Jane (Journal/Author Name Estimator) is a freely available web-based application that, on the basis of a sample text (e.g. the title and abstract of a manuscript), can suggest journals and experts who have published similar articles.
Having just gone into a different research field, I appreciate Jane as a useful tool to learn to find my way around in relevant literature. Based on, for example, the abstract of an article I find interesting, it finds me appropriate journals and authors. The next screenshot shows the results for the abstract of the Blue Obelisk paper (doi:10.1021/ci050400b ):



The Show articles feature as well as the journal annotation are rather useful to get a quick overview of what is being suggested. The list of authors seems, at first sight, populated by co-authors, and lacks any form of annotation. Room for FOAF here? They used PubMed as content provider, and text mining to align articles, but nothing really semantic, despite the group's name. The output does not seem to provide semantics either.

Schuemie, M.J., Kors, J.A. (2008). Jane: suggesting journals, finding experts. Bioinformatics, 24(5), 727-728. DOI: 10.1093/bioinformatics/btn006

TODO: April 2nd, defend my PhD work

In 4.5 weeks, on Wednesday April 2 (13:30 precisely, Aula, Comeniuslaan 2, Nijmegen) I will publicly defend my PhD work performed in the Analytical Chemistry group of Prof. Lutgarde Buydens at the Radboud University Nijmegen:



Table of Contents
  1. Introduction
  2. Molecular Chemometrics (doi:10.1080/10408340600969601)
  3. 1D NMR in QSPR(doi:10.1021/ci050282s)
  4. Comparing Crystals (doi:10.1107/S0108768104028344)
  5. Supervised SOMs (doi:10.1021/cg060872y)
  6. Chemical Metadata in RSS (doi:10.1021/ci034244p)
  7. Interoperability (doi:10.1021/ci050400b, the Blue Obelisk paper)
  8. Discussion and Outlook
Chapters 2, 3, 4, and 5 are first author papers, while for chapters 6 and 7 I am just co-author.

Summary
Chemometrics and chemoinformatics play important roles in the analysis and modeling of molecular data. In particular, in understanding and prediction of properties of molecules and molecular systems. Both chemometrics and chemoinformatics apply statistics, machine learning and informatics methodologies to chemical questions, though originating from a different background. Where chemometrics had its origins in the extraction of information from chemical experiments, chemoinformatics had roots in the representation of chemical data for storage in databases. The technological advances in chemistry and biochemistry in the past decades have led, however, to a flood of data and new questions, and the data analysis and modeling have become more complex. The standing challenge in data analysis and data exchange, is how to represent the molecular features relevant to the problem at hand. This representation of molecular information is the topic of this thesis.

Chapter 1 introduces the field of data analysis and modeling of molecular data and describes the aforementioned importance of representation of relevant features. It discusses different approaches to molecular representation, such as line notations, chemical graphs, and quantum chemical models. Each of these have limitations when used in data analysis and modeling. Numerical representations are then introduced, which allow the application of statistical and mathematical modeling approaches. These numerical representations are commonly derived from chemical graph and quantum chemical representations. CoMFA and the classification of enzyme reactions are examples were the choice of molecular representation as well as the analysis method are important.

The term molecular chemometrics is coined in Chapter 2 for the field that applies statistical modeling methods to molecular structure. It reviews the advances made in this field in recent years. New numerical descriptors for molecules are discussed, as well as approaches to represent molecules in more complex systems like crystal structures and reactions. Molecular descriptors are used in similarity and diversity analysis. The applications of new methods for structure-activity and structure-property modeling, and dimension reduction are described. An overview of recent approaches in model validation show new insights and approaches to estimate the performance of classification and regression models. The last section of this chapter lists new databases and introduces new methods that improve the extracting of chemical data from database and repositories. Semantic markup languages improve the exchange of data, and new methods have been introduced to extract molecular properties from text documents.

Chapter 3 studies the in literature proposed use of 1D 13C and 1H NMR spectra as molecular descriptor. These spectra are known to describe features relevant to physical properties like solubility and boiling point. The NMR representation is studied for the predictive powers of its PLS models for three structure-property data sets. The results indicate that proton NMR is not suitable for building QSPR models in combination with PLS. Carbon NMR-based models, however, do give reasonable QSPR models, and the regression vectors for the carbon NMR data, correlate with spectral regions relevant to molecular fragments. Nevertheless, the predictive power of the carbon NMR-based spectra is still less than models based on common molecular descriptors. It is concluded that NMR spectra should not be considered first choice when making predictive models in general, and that proton NMR should probably not be used at all.

A computational method to calculate similarities between crystal structures based on a new representation is introduced in Chapter 4. While a reference method is perfectly able to identify structures with high similarity, it fails to recognize the different similarities between two similar structures and two completely different structures. This makes it very difficult for clustering algorithm to organize small clusters of identical and highly similar structures into larger clusters. The new representation of crystal structures introduced in this chapter shows a much smoother transition in similarity values when crystal structures go from identical, via similar, and finally to dissimilar structures. Clustering a set of simulated polymorphic structures of estrone, and classification of a set of experimental cephalosporin structures reproduce expected clustering and classification.

Chapter 5 uses supervised self-organizing maps to cluster crystal structures represented by their powder diffraction pattern and one or more properties. The topological structure of the resulting maps not only depends on the similarity of the diffraction data, but also on the properties of interest, such as cell volume, space group, and lattice energy. This approach is used to analyze and visualize large sets of crystal structures, and the results show that these supervised maps not only give a better mapping, they can also be used to predict crystal properties based on the diffraction patterns, and for subset selection in polymorph prediction. The two applications in crystallography show that suitable representations and similarity measures that allow data analysis and modeling of molecular crystal data are now available. Both approaches are flexible enough to open up a new field of research; especially combinations with other classification schemes for crystal structures, such as those based on hydrogen bonding patterns, come to mind.

Chapter 6 introduces and discusses a method that allows information rich distribution of molecular data between machines, such as measuring devices and computers. Existing approaches often imply not or badly documented semantics which may lead to information loss. CMLRSS is proposed and combines two existing web standards: Rich Site Summaries (RSS), also known as RDF Site Summaries, and the Chemical Markup Language (CML). Here, RSS is used as transport layer, while CML is used to contain the chemical information. CML supports a wide range of chemical data, including molecular (crystal) structures, reaction schemes, and experimental data such as NMR spectra. It is shown that this semantic representation allows automated dissemination of chemical data, and is increasingly used to exchange data between web resources.

Chapter 7 describes a communal effort to realize interoperability in chemical informatics, which is called the Blue Obelisk movement. This movement currently consists of more than ten smaller and larger, open source and open data projects all related to chemoinformatics and chemistry in general. To increase the reproducibility of molecular representations, this chapter introduces a collaborative dictionary of chemoinformatics algorithms, and a public repository of chemical data of general interest, including data for chemical elements and isotopes, (boiling points, colors,
electron affinities, masses, covalent radii, etc.), definitions of atom types, and more. The availability of a standard set of atomic properties, open source algorithms and open data (for example via CMLRSS feeds), it is much easier to reproduce and validate published results in molecular chemometrics. Results from Chapter 3 show that such ability is no luxury.

The last chapter summarizes the efforts in this thesis and how they address the challenges in molecular chemometrics. This thesis shows the strong interaction between representation and the methods used for data analysis: molecular representation need to capture relevant information and be compatible with the statistical methods used to analyze the data. The chapters review molecular
representations and put focus on model validation using statistics, visualization methods, and standardization approaches.