Wednesday, April 30, 2008

MetWare, SKOS and Java Server Faces

The MetWare components are slowly coming together. The RAW data upload facility prototype went into beta stage, while the SKOS has proven really useful for various things.

Because of being compatible with various Java libraries and tools, we decided some time ago to use Java. We also wanted to start of with a HTML GUI to MetWare, which led us to Java Server Faces. Not being so fond of Tomcat (e.g. use by the NMRShiftDB), I was not sure how that would turn out, but Steffen was rather positive about it. And I like it :)

The source code for this screenshot is rather simple:
<tr valign="top">
<td><br/>Monoisotopic Mass:<br/>
min=<inputText id="monomassmin" value="#{metidMetaboliteQuery.monoisotopicMassMin}"/><br/>
max=<inputText id="monomassmax" value="#{metidMetaboliteQuery.monoisotopicMassMax}"/><br/>
<commandButton value="Search" id="submit" action="#{}"/>

<!-- search results -->
<p><dataTable value="#{metidMetaboliteQuery.results}" var="mbolite">
<facet name="caption">Search Results...</facet>
<facet name="header"><outputText value="Monoisotopic mass"/></facet>
<outputText value="#{mbolite.monoisotopicMass}"/>
<facet name="header"><outputText value="InChIKey"/></facet>
<outputText value="#{mbolite.inchikey}"/>
<td width="25%">
<b><outputText id="tabelName" value="#{metidMetabolite.prefLabel}"/>:</b>
<outputText id="tabelDef" value="#{metidMetabolite.definition}"/>
Key concept here is that JSF uses Java Beans, which are referred to in the above example with code like #{bean.field} for bean fields, and with #{bean.method}, assuming a bean exists with getField(), setField() and method(). The <h:outputText> stuff is JSF to work out bean details and will create HTML in the output. As really brief intro.

The Metware Beans
It is clear that java beans for Metware would be useful, and this is what I have been working on for the last few weeks. The relevant beans for the above example are automagically created from the SKOS, complemented with extra bits of RDF for the additional details, like field data type, mapping to SQL tables, and an example value. This all works very smoothly (the code to load() and save() into the SQL database is automatically generated too!) as you can see in the above example. The screenshot shows matches from a (local) live SQL metabolomics database. The text on the right side is directly taken from the SKOS.

Now, the bean library allows integration with other tools too, though this cannot be found in our current roadmap. But, for example, I have been thinking about a simple Bioclipse wrapper around these beans. What is on our roadmap involves workflows for metabolomics.

Monday, April 28, 2008

JChemPaint for Bioclipse2

Today Ola, Jonathan and I have a mini-hack session on getting JChemPaint support ported from Bioclipse1 to Bioclipse2. And, we made some progress:

I'm sure there is still a lot to do, but this is promising... :)

Oh, and BTW, this is based on the JChemPaint 2.3 / CDK 1.0.2 branch, in case you are interested in those details.

Blog Comments? No, Peer Reviews!

Via Carbon-Based Curiosities's blogroll I found a number of new blogs (on top of the list I posted yesterday), and just added them to Chemical blogspace. This is something I found in Infiniflux!:

Blog comments? No, Peer Reviews! Nice thought, Joel! I'll copy that, if you don't mind.

Sunday, April 27, 2008

Comments on 'Rethinking software access'

bbgm was rethinking software access. The blog observes:
  1. current commercial licensing is unfriendly towards home science
  2. bench tools do not easily allow mash ups
About 1
Actually, much of the work I have been doing in opensource chemoinformatics was done as 'home' science; I started as organic chemist student, and later data analyst, while the CDK/Jmol/JChemPaint was something I did at home because I liked, and needed it. I started in 1995 working on a website to aid my organic chemistry studies, the Woordenboek Organische Chemie (open data). And, I needed semantic tools for 2D and 3D display of molecular structure. Commercial offerings were not an option, for me as student, so I got involved with the Chemical Markup Language, Jmol and JChemPaint in 1997-98.

Note, that in that time free academic licenses were rarer than now. I always had, and still have, the feeling that those clauses are just there to give academics a reason to support non-opensource tools. Also note that a lot of commercial offerings started as incorporation of the code base of some PhD work. Not uncommonly, the PhD would simply be hired by the company.

Fact is, commercial chemoinformatics licenses are indeed unfriendly for scientists who maintain related hobbies at home. And, given my experience, I appreciate your worries: the high costs for those tools, which I certainly could not afford with my student funding, drove me to the opensource ideas many, many years ago.

About 2
The second issue brought up, regards the ability to make mash ups. Open source and open standards are indeed important to make mash ups, though the former only helps you work around lack of use of open standards. Using web services contributes to the solution as it has a well-defined, open standard interface. Open source is particularly important for reproducibility of scientific results (see my thesis), and is the opposite of proprietary software, not commercial software. So, it seems bbgm is just looking for Blue Obelisk projects.

On a practical note, I think that Bioclipse might just be what you are looking for, and integrates local services as well as services on the internet, just alike. Particularly, the upcoming Bioclipse2 is strong at this, and supports SOAP, BioMart, BioMoby for online services (also see this), as well as R, BioJava, CDK, Jmol as local services. You can even run Taverna workflows from within Bioclipse, if you like. Mash ups can be done in various ways. Hard code Java coders would go the RCP plugin way, for example this nanotube example. Others will prefer scripting languages, such as JavaScript and Ruby (in addition to R and Jmol scripting). Or, you might do record as script the tihngs you did graphically, using the recording feature.

Of course, there are other solutions... Bioclipse is just one, one to which I contributed.

About running webservices...
Running webservices, is basically being hosting provider, and requires some commercial model. One conflicting problem is that, at least being said, that large groups withing the potential user base, aka pharma industry, does not even like sending over their highly secret data over an httpS:// line to the outside world.

Rajarshi and the rest of the Indiana group have been running chemoinformatics webservices. They might be the provider you are looking for.

All I can say to bbgm: "Yes, your two thoughts are indeed issues, and many from within the Blue Obelisk community have been addressing them." Oh, and we will not stop either. Peter recently gave in Nature a nice overview of what we, Blue Obelisk members, have been cooking on: Chemistry for Everyone: and that includes the hobby scientist.

Thursday, April 24, 2008

More CDK-Ruby users...

Via Rich' blog, I was informed about the work by goesLightly on CampDepict, a Ruby-based application which uses the CDK for SMILES parsing and 2D diagram generation. With cdk-20060714.jar it's using pretty ancient code, and I have not seen a screenshot.

Anyway, it's nice to see another blogger into the CDK :)

Tuesday, April 22, 2008

Quality Publishing: EndNote versus InChIs

Some publishers hesitate a bit, but others go full speed ahead into the electronic publishing era. Noel comment on my post about OA/OD inviting added value:
    I heard a talk by the RSC at the ACS, saying that their RSS feeds contain InChIs now! Just thought I'd throw that out there :-)
The RSC Project Prospect is ahead of other publishers, for over a year already. Adding InChIs to RSS feeds are a cheap way of adding machine-readable chemistry to ones publishing pipeline; adding CML would allow much more detail (see this overview of CMLRSS information in my blog).

But, importantly, it allows third-parties to efficiently set up DOI-InChI tables. Cheap (Asian?) workers become rather expensive, when compared to machine mining to create such databases. Sure, the authoring becomes somewhat more expensive, but who will argue that scientists might be a bit more precise in what they publish. I, for sure, would love to see authors focus on adding InChIs to experimental sections, then that they focus on getting EndNote to put the comma, bold and upper casing in the right place, to meet journal standards.

Another publisher who takes its job seriously is Beilstein. Stephan recently showed me some of the things they are up too, like information rich figures (yes, you'll have access to the source, and identify the molecular structures in reaction schema). He also showed me to the RDF now by default available for all their articles. For example, for DOI:10.1186/1860-5397-3-50, the RDF is available here. It's indicated in the HTML with:
<link rel='alternate' type='text/rdf' title='RDF' href=''/>
There is, actually, also a lot of citation information available in the <meta> tags in the HTML, but apparently not the right stuff yet to have Zotero pick it up nicely (not sure what this Firefox plugin is actually looking for). No chemistry in the RDF it seems, but there is BIBO, FOAF and Dublin Core.

Main suggestion to Stephan, right now, would be to include InChIs in the RDF and RSS feed.

Disclaimer: Colin, behind Project Prospect, visit our group when I was still in Cologne; Stephan contributed code bits to the CDK project, e.g. this this Matrix class.

Oh, Nature is, of course, also a publisher who actively gets into electronic publishing age.

Monday, April 21, 2008

Open Access / Open Data leads to added value

Two companies recently showed two things:
  • open access and open data allow adding value
  • adding value is easier by forking
Rich' MetaMolecular set up Chempedia which combines a substructure-searchable chemical Wikipedia. There is also a page to make links to new Wikipedia monographs. Not sure why Rich chose CAS instead of the InChI, given the recent controversy on validity of CAS numbers in Wikipedia... realize that this page is for new monograph, of which the CAS number is likely not verified yet, or? On the other hand, the InChI or InChIKey is not so abundant in Wikipedia yet (I really must make an updated list).

ChemSpider has been using a similar approach to add value to existing resources. The interesting thing in this case, is that these substructure searchable versions, have an interesting spin off: it allows ChemSpider to build a valuable DOI-InChI table. So far, I spotted:

If you wonder how to integrate all data again when things are so distributed, just consider userscripts.

Tuesday, April 15, 2008

"Make all research results CC-BY"

While I do not agree in details on the statement made by Klaus, I agree with his intentions, and happy to propagate the mantra, like others did before me:

The details I disagree with:
  • no need for shouting; we can all perfectly well read it in lower case
  • CC-BY is not required; any open data license will do

Now, I know some of you disagree, and I understand the costs for maintaining and curating a database. But, if all research results would be freely available, these costs can be shared by the community, and we could all stand on the shoulders of giants.

Wednesday, April 09, 2008

The MetWare developers meeting in Halle

Today starts the MetWare developers meeting, hosted by Steffen Neumann, at the Leibniz-Institut für Pflanzenbiochemie. Steffen's group and the Applied Bioinformatics group where I now work, are co-developing an opensource platform for metabolomics data management. Not really a full LIMS system, but a system to keep track of all the facts about the experiments and samples we would use when analyzing the data in order to find new chemistry, biomarkers, etc (see this earlier blog too). Good news is, that BioAssist is developing a support platform for the NMC, and plans to use MetWare as a main component.

OK, off to catch my train now. See you online (#metware @; the wiki has an agenda for the meeting.

Monday, April 07, 2008

The CDK/Metabolomics/Chemometrics Unconference results

As announced earlier, Miguel, Velitchka, Christoph and I held a small CDK/Metabolomics/Chemometrics unconference. We started late, and did not have an evening program, resulting in not overly much results. However, we did do molecular chemometrics.

We used the R statistics software together with Rajarshi's rcdk package (an R wrapper around the CDK library) and Ron's (my PhD supervisor) PLS package (see this paper), to predict retention indices for a number of metabolites.

We ended up with this R script:
mols = load.molecules("data_cdk.sdf")
selection = get.desc.names()
selection = selection[-which(selection=="org.openscience.cdk.qsar.descriptors.molecular.AminoAcidCountDescriptor")]
x = eval.desc(mols, selection, verbose=TRUE)
x2 = x[,apply(x, 2, function(a) {all(!})]
y = read.table("data_cdk_RI")
input = data.frame(x2, y)
pls.model = plsr(V1 ~ ., 50, data=input, validation="CV")
plot(pls.model, ncomp=20)
abline(0,1, col="red")
plot(pls.model, "loadings", comps=1:2)
The AminoAcidCountDescriptor threw us a NullPointerException and there were a few NAs in the resulting matrix. The CV results were not so good as Velitchka's best models, but still a good start:

No variable selection; 200 objects, 190 variables.

  • Can we do this in Bioclipse2 too?
  • Can we improve the default CDK descriptor parameters to maximize the column count?
  • Rajarshi, what would be involved to write some wrapper code for atomic descriptors for rcdk?

Legal Advice Needed: the NIH restricting access to our CC-licensed research results

In reply to Peter's news that the NIH's PubMed Central (PMC) does not allow machine retrieval of content, I was wondering about this section in the CC license of much of the PMC content, such as our paper on userscripts (section 4a of the CC-BY 2.0):
    You may not distribute, publicly display, publicly perform, or publicly digitally perform the Work with any technological measures that control access or use of the Work in a manner inconsistent with the terms of this License Agreement.
CC-BY 3.0 reads differently, but has similar aims.

Let me make clear that I value machine readable publications much more than free (gratis, as-in-free-beer) publications. Now, the NIH initiative now just is 'Free Access'. An interesting step, but not one I care much about; not in relation to science anyway.

Now, Peter indicates that the NIH has put in place 'technological measures to control access' to the distribution of our work on userscripts (the PMC entry). That is in clear violation of the CC license.

I know that other NIH initiatives do allow this, such as PMC OAI, but that's just an 'auxiliary service'. It may come down to technical details, but some text on the PMC website is at least inaccurate:
    Crawlers and other automated processes may NOT be used to systematically retrieve batches of articles from the PMC web site. Bulk downloading of articles from the main PMC web site, in any way, is prohibited because of copyright restrictions.
They way it is described right now, it is like: You may not drive a car. Next paragraph. But, if you have a driver license, we will approve. Or, translated to this example: You may only use this and that article, but only a few of them. Next paragraph. Unless you use the following technical hole in the measure we took to disallow you access.

What the PMC website should indicate, instead, is that text mining is allowed for the PMC OAI subset, but that they would highly prefer to use the PMC OAI or PMC FTP routes. This is the least they have to do.

No matter what, I still have the feeling that any technical obstacles are disallowed by the CC-license. Any legal expert here, that can explain me if the CC license allows controlling how people have access to my material?

Friday, April 04, 2008

T plus 51 hours: a short photo impression

I normally do not do these kinds of blog items, but, in reply to Christoph's blog, here's an overview of the ceremony (see also T-26 and T+18):

This is the doctorate certificate Christoph mentioned, with also Karin and our kids:

And, here (map) was the dinner in the evening:

Thursday, April 03, 2008

T plus 18 hours: dr and preparing for the afterparty, umm ^w^w^w, CDK/Metabolomics/Chemometrics unconference

I am doctor now; I shall now be addressed as weledelzeergeleerde Egon; translating to something like quite-noble-very-knowledgeable, hahahaha. I'll put up a few photo's of the ceremony, which is actually quite formal at the Radboud University, later.

With this blog item, I would to thank everyone who left a message, sent email, etc with good luck messages. Very much appreciated! I'd also like to thank my supervisors, promotores Lutgarde Buydens and Peter Murray-Rust (he mentions the event here, and Ron Wehrens for their confidence in me and their guidance on the path towards the post-doc life. I also thank all those who attended my defense; I had a brilliant day, and actually enjoyed talking to those who took place in my promotion committee and who asked me the not-really-nasty-questions about my work.

CDK-Chemometrics in Metabolomics Unconference
For today, I organized a small, informal unconference, oriented around the CDK, chemometrics and metabolomics. I'm certain we will be online much of the day, as we typically do. The meeting will start around 10:00 CEST, but we'll attend a seminar by Marjana Novič at 11:00 CEST. If you happen to be in Nijmegen, just drop in on the Analytical Chemistry department. Otherwise, join the #cdk chat channel in the network.

What we'll do?? Hey, it's an unconference; we have no idea yet :)

Tuesday, April 01, 2008

T minus 26 hours: defending open source chemoinformatics (and more)

In about 26 hours from now, I will be defending my PhD thesis. Follow that link to read the summary; I was thinking if publishing my introduction and discussion (the rest has been published in peer-reviewed journals) on Nature Precedings; would that be a good idea? Otherwise, I'll post it in my blog. If you just happen to want to attend the public defense, it's here: