## Sunday, August 31, 2008

### UgiChem2CML

The nice thing about a hacksession, is that you have something to write about. Below a screenshot of a Ugi reaction in Bioclipse... note the source tab of the editor, which holds the CML. Now, JChemPaint can do reactions too (I did that in 2003 in Peter's group, but seems to be offline at this moment), but this was the quick hack to do the CMLReact in Google Docs (or soon to be):

And this is us this afternoon:

### Science Blogging 2008 London was Cool!

Definately not a first post, but here are my experiences of my first blogging conference (see also this and this, the latter using semantic markup for the event): it was fun! My suggested unconference was not chosen, because I, as I usually do, focus to much on how instead of why one wants to do something. Nevertheless, I got to say my things, so I won't complain. While I have not noted a vivid live coverage in blogosphere of the conference, several people were live covering the meeting on FriendFeed. Really nice, because you can comment on statements the speaker makes, while he is talking. People have been using the sciblog tag, which should give you enough hits in the various aggregators and social sites.

The main thing I liked about this conference was the chance to meet fellow bloggers. I am not so much interested in why others blog, and generally not reading blogs about the scientific life. I have written up in the past why I blog, so read that. What does interest me is how we can enhance blogs to make them easier to aggregate, search through, retrieve data, etc, etc. What I'd like to be able to do is read a blog item, note that it is about topic I like, go of into Taverna or Bioclipse (possible via Ubiquity), and hit the get me that data blob button. Now, I don't mind it being hidden behind a paper, being on Google Data, or whatever, I just want to simply hit that button.

Returning readers of this blog that semantic chemistry is something I have worked on in the past, but while Chemical blogspace has a nice people-blogged-about-this-molecule section, it has not really picked up. Main reason is, that people cannot or do not want to add semantic markup. Now, the one thing I like most of the conference discussions yesterday (the pub was too noisy for me to reasonably chat with anyone), was the proposal to use Ubiquity for adding these semantics. So, commands like addSechemticMarkup, convertSMILESIntoInChIKey, that sort of things... The cool thing here, is that it is blogging service independent. It works for anything inside Firefox, including wikis, email, knols, whatever. Now, one obstacle is that Ubiquity involves a command line; and we know how much people dislike command lines, but I'm sure they will come up with Guiquity. Actually, maybe this is the activities that Mike has been talking about...

### Creating CMLReact from UsefulChem Ugi Reactions

Cameron, Jean-Claude and I were invited to Peter's place in Cambridge, where we are now hacking on CMLReact for the Ugi reactions Jean-Claude has been working on. I just finished a script that uses the CDK and Sam's interface to the InChI library to convert a list of four reactants and one Ugi product into CMLReact (doi:10.1021/ci0502698 S1549-9596(05)00269-X). The full BeanShell script looks like:
#!/usr/bin/bshimport java.io.File;import java.io.FileReader;import java.io.BufferedReader;import org.openscience.cdk.*;import org.openscience.cdk.exception.*;import org.openscience.cdk.inchi.*;import org.openscience.cdk.interfaces.*;import org.openscience.cdk.io.CMLWriter;import org.openscience.cdk.libio.cml.Convertor;import org.xmlcml.cml.element.CMLReaction;import net.sf.jniinchi.INCHI_RET;InChIGeneratorFactory factory = new InChIGeneratorFactory();// Get InChIToStructureFile file = new File("inchi.ugi.txt"); // five inchis expected, last being the productBufferedReader reader = new BufferedReader(new FileReader(file));String first = reader.readLine();String second = reader.readLine();String third = reader.readLine();String fourth = reader.readLine();String product = reader.readLine();System.out.println("First: " + first);IMolecule firstAC;{  InChIToStructure intostruct = factory.getInChIToStructure(first, DefaultChemObjectBuilder.getInstance());  INCHI_RET ret = intostruct.getReturnStatus();  if (ret == INCHI_RET.WARNING) {    // Structure generated, but with warning message    System.out.println("InChI warning: " + intostruct.getMessage());  } else if (ret != INCHI_RET.OKAY) {    // Structure generation failed    throw new CDKException("Structure generation failed failed: " + ret.toString()      + " [" + intostruct.getMessage() + "]");  }  firstAC = new Molecule(intostruct.getAtomContainer());}System.out.println("Second: " + second);IMolecule secondAC;{  InChIToStructure intostruct = factory.getInChIToStructure(second, DefaultChemObjectBuilder.getInstance());  INCHI_RET ret = intostruct.getReturnStatus();  if (ret == INCHI_RET.WARNING) {    // Structure generated, but with warning message    System.out.println("InChI warning: " + intostruct.getMessage());  } else if (ret != INCHI_RET.OKAY) {    // Structure generation failed    throw new CDKException("Structure generation failed failed: " + ret.toString()      + " [" + intostruct.getMessage() + "]");  }  secondAC = new Molecule(intostruct.getAtomContainer());}System.out.println("Third: " + third);IMolecule thirdAC;{  InChIToStructure intostruct = factory.getInChIToStructure(third, DefaultChemObjectBuilder.getInstance());  INCHI_RET ret = intostruct.getReturnStatus();  if (ret == INCHI_RET.WARNING) {    // Structure generated, but with warning message    System.out.println("InChI warning: " + intostruct.getMessage());  } else if (ret != INCHI_RET.OKAY) {    // Structure generation failed    throw new CDKException("Structure generation failed failed: " + ret.toString()      + " [" + intostruct.getMessage() + "]");  }  thirdAC = new Molecule(intostruct.getAtomContainer());}System.out.println("Fourth: " + fourth);IMolecule fourthAC;{  InChIToStructure intostruct = factory.getInChIToStructure(fourth, DefaultChemObjectBuilder.getInstance());  INCHI_RET ret = intostruct.getReturnStatus();  if (ret == INCHI_RET.WARNING) {    // Structure generated, but with warning message    System.out.println("InChI warning: " + intostruct.getMessage());  } else if (ret != INCHI_RET.OKAY) {    // Structure generation failed    throw new CDKException("Structure generation failed failed: " + ret.toString()      + " [" + intostruct.getMessage() + "]");  }  fourthAC = new Molecule(intostruct.getAtomContainer());}System.out.println("Product: " + product);IMolecule productAC;{  InChIToStructure intostruct = factory.getInChIToStructure(product, DefaultChemObjectBuilder.getInstance());  INCHI_RET ret = intostruct.getReturnStatus();  if (ret == INCHI_RET.WARNING) {    // Structure generated, but with warning message    System.out.println("InChI warning: " + intostruct.getMessage());  } else if (ret != INCHI_RET.OKAY) {    // Structure generation failed    throw new CDKException("Structure generation failed failed: " + ret.toString()      + " [" + intostruct.getMessage() + "]");  }  productAC = new Molecule(intostruct.getAtomContainer());}IReaction ugiReaction = new Reaction();ugiReaction.addReactant(firstAC);ugiReaction.addReactant(secondAC);ugiReaction.addReactant(thirdAC);ugiReaction.addReactant(fourthAC);ugiReaction.addProduct(productAC);StringWriter stringWriter = new StringWriter();CMLWriter cmlWriter = new CMLWriter(stringWriter);cmlWriter.write(ugiReaction);cmlWriter.close();System.out.println(stringWriter.toString());
My apologies for the code duplication, but never tried inline functions in BeanShell yet... You can monitor the efforts at Google Docs.

## Friday, August 29, 2008

### Leaving to Science Blogging 2008 London

Have to leave to the airport any second now for the Science Blogging 2008 in London, so nothing much I shall say. Hope to see you tomorrow at the Royal Institute!

Update: live coverage at Friend Feed.

## Tuesday, August 26, 2008

### MetWare screenshot: propagating XML Schema data types

Just a quick screenshot. Remember our use of SKOS in MetWare? Steffen has been working on creating integrated JSF pages, while I am focusing on autogeneration of blobs. The below screenshot is such a blob, called a UI component in JSF, which allows easy embedding the the aggregations Steffen is working on.

Autogeneration of web content benefits greatly from well defined input, including data types. MetWare uses XML Schema Data Types for this, as mentioned ealier when I briefly mentioned generation of search pages. That example showed the creation of range input on xsd:integer types. The below screenshot shows the different output for xsd:string (input text box) and xsd:boolean:

Now, this example is not really shocking, but MetWare defines additional types, for example an InChI data type:
<simpleType name="inchi">  <restriction base="string">    <pattern value="InChI=1/.*"/>  </restriction></simpleType>
This allows me to tweak the HTML output created by the JSF pages to include microformats to support the Sechemtic userscript (see also doi:10.1186/1471-2105-8-487).

Or, to provide a drop down box, listing the allowed values:
<simpleType name="deviceVendor">  <restriction base="string">    <enumeration value="BioCrates"/>    <enumeration value="Bruecker"/>  </restriction></simpleType>

## Thursday, August 21, 2008

### MetWare screenshot: spectrum support #2

As promised yesterday, here's the pretty visualization of the mass spectrum, using JavaScript from the PRIDE project:

Note the manual adding of peaks at 10 and 100 m/z to get the real peaks somewhere in the middle instead of on the left and right border of the graph.

Meanwhile, the search page is now autogenerated too, and the types of searches allowed (min, max in the picture) again depends on the XML Scheme data type defined in the MetWare SKOS:

## Wednesday, August 20, 2008

### MetWare screenshot: spectrum support

Not visually attractive, but that will be solved when Steffen gets his hands on it. For now, I'm happy with a table formatting. Reason: it uses XML Schema to define a dataType, which is recognized by our code generators in MetWare (see also this presentation), and used to create a easy to use Java API, which, in turn, can be used in this JSF snippet:
<h:dataTable value="#{metobservCharacterizationMassspectrum.spectralPoints.points}" var="specpoint">  <h:column>    <f:facet name="header"><h:outputText value="m/z's"/></f:facet>    <h:outputText value="#{specpoint.mz}"/>  </h:column>  <h:column>    <f:facet name="header"><h:outputText value="Intensities"/></f:facet>    <h:outputText value="#{specpoint.intensity}"/>  </h:column></h:dataTable>
The <dataTable> @value point (via the faces-config.xml) to the MetobservCharacterizationMassspectrumBean, which has a getSpectralPoints() method (autocreated from the <skos:Concept> SpectralPoints, which has a convenience method List<SpectralPoint> getPoints().

SpectralPoint in turn has the methods getIntensity() and getMz() also used in the above JSF snippet. For convenience, SpectralPointArray also has two other methods: double[] getIntensities() and double[] getMzs() (which I'll have to rename to reuse the code for NMR support :).

So, here's the outcome:

Final note, given the dataType, the MetWare bean also has the logic to convert the data back and forth into a SQL serialization, which may eventually use base64 encoding, but currently looks like 61.0,100.0;62.0,1.1, as defined by the regular expression of the XSD dataType for spectralPointArray.

## Thursday, August 14, 2008

### Profiling the CDK atom typer

I was doing some profiling (YourKit and Eclipse3.4) of the CDK atom typer, and it turns out that most time is spend on the perception of nitrogen atom types, which seems to be caused by the loadClassInternal() method of the JVM (java-1.5.0-sun-1.5.0.16 on Ubuntu Hardy):

## Wednesday, August 06, 2008

### Scientific progress is a primary human need

Deepak asked me to comment on his blog post Is your web service open source?. With a slight delay, I did on FriendFeed. I'll copy it here.

The question is about getting return-on-investment: if I developed a new algorithm (or new efficient implementation), how can I make some money with that, to feed me, continue development, maybe just maintenance. And, how does that work for scientific software, which can best be opensource? So I replied:
Deepak, did not have time to read it earlier. I have not worked out monetizing open source chemoinformatics. As a scientist, I take the position that any implementation must be open source; that's mere consequence of the scientific requirements for peer review and reproducibility. I do understand that further research has to be funded; by making code proprietary, the guy doing the further research is the original author. That's not necessarily the right thing for scientific progress.

As a human being, I need feeding. So, I certainly understand making code proprietary, as I have not seen much success in funding ROI via support, though I do think this is the way to go, for scientific software that is. Web services are clear services, sort of consultancy with human involvement. And consultancy is proven technology. Sell access to your service. Anyone can theoretically set it up, but practically... so you basically sell your IT expertise.

A third aspect is user friendly GUIs. Say, ChemOffice, say Bioclipse. These are also scientifically not interesting to develop. Bioclipse, being open source, is an interesting example. The core is open, free, any one can contribute, *and* embed that cool new algorithm easily. This 'plugin' can be proprietary and sold commercially. No scientific shame, but with a chance for getting some ROI.

Science should be open, and never be a source of capitalism. I am not against capitalism, but I find it rather unethical to say, sure, your starving (dying from AIDS, whatever) guy, surely I can help; it will just cost ya. Making money because people like buying big cars, Pokemon cards, want their DNA sequenced, sure, no problem. But don't start making money from primary needs. Scientific progress is a primary need.
Some more details on my background on these issues can be found in I don't blame Individuals in Commercial Chemoinformatics and Why ODOSOS is important.

### Mapping Peoples Interest: Google Insight Search

Google has a new service: Google Insight Search, and I was wondering if it could tell me to use chemoinformatics or cheminformatics... No, it can't. In both there is a declining interest (only chemoinformatics shown):

More interesting is that the interest in chemoinformatics only comes from India:

This tool holds for both flavors too.

## Sunday, August 03, 2008

### "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete"

The thought triggering editorial "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" by Chris Anderson can't have escaped your attention. I was shocked when I read the title and the comments made on the blogosphere and on FriendFeed.

How can he say that?! There is no analysis of data anymore?!? Don't we need to understand why X correlated with Y?!? Etc etc.

So, when I read yet another comment, by my respected opensource chemoinformatician Joerg, I just had to read the piece myself. Joerg disagrees with the statement from Chris' editorial that
[c]orrelation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.
At first, I would agree with Joerg. It's nonsense; any QSAR modeler can explain in details the dangers of overfitting, extrapolation, etc, etc. Not to mention that basically zero mathematical modeling methods can create a statistical signification non-zero regression model with less than 50-100 chemical structures (chemical diversity dependent, etc).

Ok, back to the editorial. There are some arguments on Google, tons of data. Number of incoming links as measure of page importance (brilliant choice, but actually a model, IMHO, which Chris seems to step over). Tons of data. Oh, mentioned that already.

Mmmmm... but wait. Tons of data? The editorial actually refers to petabytes: Petabytes are stored in the cloud. (Whatever the cloud is... just another buzzword, trademarketed too, it seems).

Eureka! Chris is right, Joerg is wrong!

Yes! Then it hit me, Chris is actually correct in his statement, and I was wrong (and Joerg too). If we move away from 50-100 molecules in our QSAR training, but use 10k of chemically alike molecules, then our modeling approaches (if capable of handling the matrices) would have a much, much smaller chance for overfitting, extrapolation (there is much, much more interpolation now), etc. The chances of getting random correlation become insignificant! Actually, Chris is making the argument QSAR modelists have been making for decades: we do not know the mode of action in detail, as we can make, given enough training data, a reasonable regression model to predict the action! Joerg and I have been making the same argument as Chris in our PhD theses! We do not need theory; our QSAR regressions make theory obsolete! (Well, surely, we'd still prefer the theory behind the action, but we lack the measuring techniques to see what actually is happening. Joerg, still agreeing with you, so to say ;)

Except for one thing. Joerg and I suggested 'enough' molecules are required for statistical sound regression. Chris, on the other hand, even makes the point that regression is no longer needed at all at the petabyte scene: we just look up what is happening. Does this hold for chemistry? For QSAR? Petabyte data equals about, say 10kB data per structure, maybe less if we use InChI and neglect conformer info, 100.000.000.000 structures. About 5000 times ChemSpider, if not miscounting the zeros (we don't care about a ten-fold at this scale anymore). Maybe, maybe not. Maybe chemical space is too diverse for that, considering a petabyte of chemical structures is enormously insignificant to the full drugable space (was about 1060, not?)

But not at all? This lookup approach is actually commonly used in chemoinformatics! Even at a way-below-pentybyte scale: HOSE-code-based NMR prediction is a nice example of this! We do not theorize on the chemical carbon NMR shift, we just look it up!

Certainly worth reading, this Wired editorial!

PS. One last remark on the title... I'd say the the scientific method is more than just making theories... I feel a bit left out as data analyst... :( I guess the title should have said 'one of the Scientific Methods'...

## Friday, August 01, 2008

### Online, multiplayer metabolomics game!

I was just organizing my toreads, when I found this link: metabolaspel.nl, an online, multiplayer metabolomics game! It's in Dutch, but I guess anyone will get the idea :) Two teams, each may have two players, fight each other in sugar-fat conversion, by tuning the metabolism parameters:

The game board should look familiar:

I finally found a worthy follow up for Civilization :)