Pages

Monday, August 22, 2011

CDK 1.4.2: the changes, the authors, and the reviewers

Not so long since the CDK 1.4.1 blog post, but about a month after the actual release, there was already more than enough content for the next minor release. The 1.4.2 release adds another batch of atom type by Asad, Nimish, and Gilleain, which constitutes the majority of the patches. There is also some fixing of the creation of JavaDoc, and it seems that the link to the Git repository got broken somewhere. It turns out that the Tag.position()-returned SourcePosition class no longer returns the full file name path, but only the ClassName.java in some intermediate folder, causing all package information to be lost. And, more apparently got lost, because we no longer seem to have module pages. Clearly, some help with project maintenance is appreciated!

Other cool changes include a patch by Thorsten to speed up Morgan number calculation, a reference to the paper by Miguel on metabolite identification (DOI:10.1093/bioinformatics/btr409) for which he wrote a lot of code for the CDK, a bug fix, code clean up patches by Dmitry allowing this branch to be compiled with Java5 again, and the first patches by my son :)

The Changes
  • Apparently something changed in the JavaDoc API, and I only get the file name now, so added this work around to be at least somewhat useful 819f2fc
  • Module pages are lost, remove the link for now f1ffaca
  • cleaner patch for Fe atomtype test cases added ed3a554
  • cleaner patch for Fe atomtype data added b92b8b1
  • cleaner patch for Fe atomtype added da8d3d0
  • Hg Atom type owl added 66d6321
  • Hg Atom type added d9fc795
  • Hg Test cases added 0e3d587
  • Added Nimish who did a lot of the hard work from the atom type patches from Asad's lab, which are going in under Gillieain's name 888ab40
  • Added new contributor's name to the copyright/license header f006ffb
  • made the creation of morgan numbers N times faster, where N is the number of atoms in the AtomContainer 1c5ba83
  • Added the new CDK contributor of the previous two patches 5d268f5
  • changed assertEquals(true to assertTrue( and the same for false 2 cd467ca
  • changed assertEquals(true to assertTrue( and the same for False d3e08a2
  • Fixed charge in unit test: +4 not 0 c8bf1f1
  • Te final 17d6044
  • Rb final af5f764
  • Ra final 5ebe425
  • Made the code Java 5 compilable. 7406c2b
  • Removed usefull JavaDoc f6076d5
  • closed @cdk.bug 3273205 and created a test. Control that IsotopeContainer doesn't get null. 9733d0e
  • MolecularFormulaManipulatorTest: Cleaned up javadoc - removed all empty/meaningless javadoc - wrote short oneliner javadoc for the more complicated methods 681cef8
  • MolecularFormulaManipulatorTest: Updated testGenerateOrderEle to match updated method 9028b3c
  • Cr final fce1c7a
  • Be final ac15cb5
  • Ba final d9632c1
  • Au final 936d2d3
  • Ag final 75036b8
  • Cl final 5920102
  • In final 78b3d27
  • Hooked in Cd atom type perception bd3ca4d
  • Cd final 4d68c27
  • Pu final 0eebd74
  • Th final c684298
  • Include CDK deps in the classpath, to not get false positive warnings about missing CDK classes 2701ea6
  • Added citation:RojasCherto2011 which uses some CDK classes to generate elemental formulas. 99ac224
The Authors

13 Gilleain Torrance
11 Egon Willighagen
 6  Syed Asad Rahman
 2  Jules Kerssemakers
 2  Lars Willighagen
 2  Miguel Rojas Cherto
 1  Thorsten Flügel
 1  Dmitry Katsubo

These are patches from four countries and six independent groups. I am still hoping to find some time (or someone else) to make a Google Map that plots the commits in time on a world map :)

The Reviewers

25  Egon Willighagen 
  2  Syed Asad Rahman 

Yeah, we're still short on people interested in doing review work for the CDK. Feel free to put that up on your CV, along side reviewing work for journals! 

Saturday, August 20, 2011

Cleaner CDK Code #10: clean patches

The CDK code base is not just a regular dump of Java source code; it is an annotated dump of Java source code. You might have heard about git blame, and if you did, this would be a good time to start reading up on git, e.g. using this great book: Git from the bottom up. However, that will not tell you about git blame, but The Git Community Book will, and the man page will give you all the details.

We take advantage of the history of a file, as it helps us understand the full picture, complementing JavaDoc, inline comments, proper variable names, etc, etc. The annotation links each code line to a commit message. And that also explains that CDK reviewers are strong on good commit messages. No useless messages like 'fixed a bug', but a message that actually describes what has been fixed, and how. That's hard to do, but we are all increasingly trained twitterers, so we are trained to say much in 140 chars. Well, some are. So, I always hope to see something like "made the creation of morgan numbers N times faster, where N is the number of atoms in the AtomContainer" (like today).

What I do not like to see, is line changes that do not actually change something, for example, because they 'fix' whitespace. First, they ruin the git line annotation, by linking a random commit to a particular code line. Second, the reviewer does not know if the line has code changes, or just whitespace changes, and has to check the line in detail anyway. Waste of precious time, where code review is already quite a bottleneck in the CDK development process.

So, no stuff like this in your next patch please (it's extracted from a larger patch):

-        mol.addBond(0,1, CDKConstants.BONDORDER_SINGLE);
-        mol.addBond(1,2, CDKConstants.BONDORDER_DOUBLE);
-        mol.addBond(2,3, CDKConstants.BONDORDER_SINGLE);
-        mol.addBond(3,4, CDKConstants.BONDORDER_DOUBLE);
-        mol.addBond(4,0, CDKConstants.BONDORDER_SINGLE);
+        mol.addBond(0, 1, CDKConstants.BONDORDER_SINGLE);
+        mol.addBond(1, 2, CDKConstants.BONDORDER_DOUBLE);
+        mol.addBond(2, 3, CDKConstants.BONDORDER_SINGLE);
+        mol.addBond(3, 4, CDKConstants.BONDORDER_DOUBLE);
+        mol.addBond(4, 0, CDKConstants.BONDORDER_SINGLE);

Update: It's fine to have whitespace changes as separate patches; if one knows only whitespace changes, that requires a different kind of reviewing. Just don't mix it with functional changes that require more in-depth reviewing.

Speeding up the CDK: Morgan numbers

Thorsten Flügel found a nice speed up for the CDK as part of the work in Dortmund on Scaffold Hunter: calculation of Morgan numbers. He has actually written a set of patches, and analyzed several bottlenecks. I expect more of that work to enter the CDK. Below is my observation of the speed up:


The patch for this has been pushed to cdk-1.4.x now.

Calculation of Morgan numbers is used (canonical) SMILES generation, but also in the isomorphism checker, so the performance boost is probably going to show up at many places. Got numbers? Blog them!

Saturday, August 13, 2011

Rough guide to writing CDK descriptors

Some time ago Andrew asked me how to write molecular descriptor implementations for the CDK. I have no such chapter in my book yet, and at that time wrote up a quick overview of the general steps. In the near future I will elaborate on those steps, but just to have them more easily recoverable, here are those steps as I replied on FriendFeed at the time:

1. get yourself a working CDK development environment in NetBeans or Eclipse (I got experience only with the last) - You
2. create a new class extending the interface: IMolecularDescriptorYou
(or IAtomDescriptor, IBondDescriptor) - You
3. register your descriptor in the descriptor ontology (a Blue Obelisk project) - You
3a: open the file descriptor-algorithms.owl and look for an existing <Descriptor>, e.g. atomCount - You
3b. not the resource ID, which must be unique, and represents a full URI like http://www.blueobelisk.org/ontolog... - You
3c. that URI you will use in your descriptor impl class .getSpecification() method, see e.g. in the BCUTDescriptor.java - You
4. decide if your descriptor has parameters, which it does not have to - You
if it does not .getParameters() should return a zero length Object[] array - You
5. .getDescriptorNames() should return labels for each value you return, e.g. some descriptor algorithms return multiple values, like the BCUTDescriptor, while others return a single value, like the XLogP descriptorYou
oh, and I guess step: 0. decide which version you like to develop against. Recommended is against the cdk-1.4,x branch at this moment, but master is good too. If you must, cdk-1.2.x is possible too, which is against the current stable release.You
6. decide what your descriptor will return... also discussed in step 5. A single value? A double, or an integer? A boolean perhaps? Or an array of integers? this is what the methods .getDescriptorResultType() is about. - You
Check the implementations of IDescriptorResult.java: BooleanResultType, DoubleArrayResultType, etc -You
if you return an array of values, make sure the length of getDescriptorNames() is the same! - You
7. implement .calculate() -> that's where your code will do its thing - You
this method will return a DescriptorValue, which will wrap provenance and the actual value - You
the actual value is an impl of IDescriptorResult, but make sure to take the subclass of the FooType. - You
That is, if your descriptor returns something of BooleanResultType, the actual value is a BooleanResult -You
Check existing impl, and don't be afraid for making mistakes in initial versions... we all do, like this bad code in the BCUT descriptor I just discovered :) -> https://gist.github.com/984635 - You
8. in case your descriptor calculation cannot be completed (e.g. needs 3D coordinate, but none in the input), it must return NaNs, ensuring that the resulting descriptor matrix keeps rectangular - You
That's more or less it... questions? - You
Thanks Egon - I'm sure I will have questions. :) - Andrew Lang
Make sure to check the Padel project, which already has a number of descriptors implemented in CDK API, but not yet ported to the CDK: http://padel.nus.edu.sg/software/padeldescriptor/... We started an attempt here → https://github.com/cdk... - You
Oh, and add a unit test like this: https://gist.github.com/985200 (that will automatically detect inconsistency problems, and helps you get your implementation right) - You

Friday, August 12, 2011

Usability: what happens if you neglect less abundant personas

Despite some an initially hesitant BioStar community, I got some good replies on my question about biology personas, including good material from an Søren Mønsted of CLC bio. Coincidentally, a few humorous perspectives came online, which in fact nicely demonstrate what I'm at.

When building a new platform, you need to know who will be using it and how, and how those people will interact. So, for our ToxBank design we need personas to do the requirement analysis, and I have created initial draft personas now, which I hope I'll be able to share later.

So, how people interact is important, as communication is central to scholarly research. For example, this is why we blog: they are like conferences. And some insight in how the various personas look at each other can be helpful in describing personas and modeling an social science platform. Matus Sotak (aka @biomatushiq) created this funny but right-on overview:


So, there it is: five personas, each of whom characterizes the other. These views reflect how others think about that persona, which is what a persona is all about: a virtual character we recognize and can characterize in terms as done in this plot. If we hook this to requirements, we could observe that the less-knowledgeable need better access to important literature. Just to name something off the top of my head.

The second is a XKCD comic. This one is more important to the message of this post: what happens if you neglect personas? The above comic shows that ignoring personas is daily business, but is that bad?


This show two personas, an average user who appreciates cool GUIs and apps on cool topics, and a regular dude who lives in an area where tornadoes actually occur. The take home message here is that mere ratio of persona abundance is not generally a proper guide for design.

Now, try to map these two comics to anything you see around. For example, do the five personas match your research group? How does the head of your group handle this? Is hes accepting the status quo, or is hes trying to overcome these stereotypes? How do these personas get reflected in author lists? How does that map onto how you think about your EU project partners? Is it useful?

Repeating this experiment for the second comic is more useful. For example, map this comic to your citation list, and then reevaluate the impact of your research. This is exactly why CiTO is crucial. For our ToxBank project this last observation has major implications too.

Thursday, August 11, 2011

CDK 1.4.1: the changes, the authors, and the reviewers

It seems I had forgotten to blog about the stable update for CDK 1.4, but here it is for 1.4.1 (download). I also finally get fed up with searching my blog each time for those git scripts for commit statistics to write up this post, so I now posted them in this gist. I also took the opportunity to now point to GitHub commit pages, rather than those on SourceForge.

As a decent stable update release, nothing much happened. A good part is atom types: I added a few myself, and the first bits of work of the patch by Nimish and Gilleain in Asad's team made it in. Nimish has done a great job over the summer to find the details (some details are still missing, such as hybridization info for many metallic elements) for atom types found in KEGG. Besides some minor code clean up, only one other thing happened. The addition of InChINumbersTool, which you can read about already in my book (see figure), but I'll be blogging about that pretty soon too.
  • Ge final f3a8828
  • K final 8bae588
  • Li final 2641484
  • Hooked in iodine atom type detection 3d4f7a5
  • I final 82a60f0
  • Na final c2554bb
  • Hooked in As atom type detection 2ed384b
  • As final f0a327d
  • Si final c3d28b6
  • Added attribution 50ef575
  • Moved countTestedAtomTypes() to the end, to fix the checking if all atom types have been tests; also added a comment that is must be the last method 229f036
  • Mn final 105850e
  • Added missing dependency a872522
  • Added a helper class to get atom number for the heavy atoms, using the InChI algorithm abcc00a
  • Added the P.ane atom type: 5-coordinate neutral P 90371fd
  • Added a sp1 phosphor type, as found in PubChem CID 138843 (reported by Julio) d0e0bf2
  • Added a charged phosphor type, as found in PubChem CID 20643237 (reported by Julio) 19e0875
  • Added missing atom type properties for Ne, Mn.2plus, Fe.2plus, Al, Ni.2plus, Mg.2plus, K.metallic, K.plus (addresses #3355759) dae49f5
  • More descriptive variable names (fixes #3310938) and full copyright statement e871b67
The Authors

12  Egon Willighagen
  8  Gilleain Torrance

Gilleain extracted the small, single atom type patches, but, as said, much of the work was done by Nimish.

The Reviewers

  6  Egon Willighagen 
  6  Rajarshi  Guha 

Wednesday, August 10, 2011

Plotting molecular properties for (sub)sets

For a toxicology paper we are writing up, I need to create a few plots showing how the toxic and non-toxic molecules differ (or not) with respect to a few molecular properties, such as logP or the molecular weight. The rcdk package provides all, of course, except for a nice convenience method (or does it?) to make a plot. That is, I just want to do something like:
plot.propdist(
  mols,
  selections=list(all, actives, inactives),
  descriptor=
    "org.openscience.cdk.qsar.descriptors.molecular.WeightDescriptor",
  main="", xlab="Weight"
)
And now I can. The result looks something like:


Not much difference in this plot. The colors can be changed, if you like, overwriting the rainbow defaults.

The source code of my method (licensed MIT):

Tuesday, August 09, 2011

Blog planets are like conferences... (aka R-bloggers.com)

Blog planets are websites that aggregate blog feeds around a particular topic or project. It is probably called after one of its first implementations, the Planet software. These planets are like conferences, rather than journals. Like conferences with a continuously ongoing year-around poster session. And like any good scientists you blog (read: present posters) and you join blog planets (read: present your poster at conferences). The reality is that many of our peers are afraid of presenting posters at conferences (read: they are afraid of blogging).

This week my blog got accepted (read: I submitted an abstract which was reviewed and accepted) to R-bloggers.com. I do not present all my posters at this venue, and use labels to identify which posters go to this meeting. For this planet, those are labeled R. And unlike other virtual worlds, these virtual conferences venues (read: a web site) are easy to reschedule. With a simple click I switch from today's floor (read: a web page) to a room dedicated to me (read: another web page).



There are many other of these conference I attend, including Planet CDK, Planet Bioclipse, Chemical blogspace (quite general topic: chemistry, but with sessions on many topics, like cheminformatics), Planet Eclipse, Planet RDF, Nature.com Blogs (very general too, but also with dedicated floors, like chemistry) and a few more I cannot think of right now. In science such planets do not exist in this form, really, The closest things are blog service providers, like Science 3.0. I guess these are like conferences for general sciences, where you're kind of lost in which corner you really belong, and you cling on to a few bloggers (read: colleagues) whose work (read: posters) you know you'll probably like.

Which conferences do you visit with your posters?

Sunday, August 07, 2011

Usability

Usability. I am not an expert in Human-Computer Interaction (HCI) at all. Worse, I make the crappiest looking interfaces, typically. So, that's said. Usability. Wikipedia writes that "[U]sability is the ease of use and learnability of a human-made object."

A cheminformatician is, despite doing cool science, per popular demand by peer scientists, also a HCI expert to at least some extend. Scientists want usability. It is merely an extension of any scientist being a Human-Paper Interaction (HPI) expert to some extend (you know, getting the bibliography properly typeset-ed).

Now, what is usability. What is it that someone means if he says your system has a 'usability issue'? That causes any cheminformatician to be some sort of HCI expert. I have had usability discussions many more times than I personally care about. Too often these discussions are held without defining who the users actually are. Are they chemist/biologists for whom Excel is the supreme data analysis tool, or statisticians who work with Matlab or R, or are they hackers (like Pierre or Neil perhaps) who just want to get their work done.

Taverna and KNIME primarily target a user who is thinking visually and who like to see what happens with their data. Jmol users do not even what to see what happens to their data (file reading, etc), and only care about seeing it it nice colors. The Chemistry Development Kit on the other side is targeted at hackers who know and want to know in detail what they are doing and what is going on.

Importantly, the last paragraph talks about the most visible part of usability: ease of use. In particular, easy of use to humans. However, readers of my blog there is more than humans: there is software too, and these too are users of a system. Here the easy of use is defined by the Application Programming Interface, or API.

So, any system is oriented at multiple user types. And each user type will have their own set of requirements. So, in a requirement analysis process, you identify the user types and associate requirements to those. Now, my software engineering book is hidden in some box, and I can therefore not cite some good practices standards right now, but the bottom line is that talking about usability without a set of project-defined user types is difficult, and may in fact result in heated discussion, where people probably want to same thing, but just are not aligned, resulting in confusion of priorities. (This sounds wise but I get fooled each meeting again myself.)

Targeting more than one user type double the effort. Yet, in science this is important. Particularly for large projects where a lot of user types are expected to interact anyway: project manager, bench chem/biologists, statisticians, data warehouses, etc. An agreement on what users are being target are core to the analysis. Bioclipse is example software where multiple user types are targeted: the visually oriented human (that will use the graphical user interface (GUI), like the Bioclipse-OpenTox one), and people who want full control (and use a scripting language).

Once the user types are defined, we can start think about data flow and how to model that. It is important here to found a common ground and that underlying technologies are the same. That requires your design to be expressed in layers that build on top of each other (e.g. as done in the TCP/IP and OSI network stacks). Multiple applications oriented at multiple user types must use the same lower layers. Some initial agreements about what such a layered approach looks like for you project is important too.

Now, we're not done yet. There is the learnability aspect of usability. That is often neglected, and the discussion often only focuses on the easy of use. Bioclipse is based on Eclipse and they have several approaches for learnability, one we adopted in Bioclipse: cheat sheets (I think a great Open Standard!). They talk the user through a particular process, but at the same time link tightly to the software and they can even make things happen in the software, by running certain actions. This way, it teaches the users around in the design.

I personally like scripting very much, hacker that I am. Just because of the learnability aspect of HCI. Scripts are not for everyone, but for those who know a bit about programming, scripts are a perfect tool to teach others about how your product works. This is why projects like MyExperiment exist: to share scripts (and workflows of course, but those are just graphical scripts). The are explicit, show what is happening, etc, and thus are the most informative means to get your message across. This is why my Groovy Cheminformatics book is full of scripts too. For GUIs, screencasts server pretty much the same role, but are much less interactive: you cannot pause a screenshot just to see what happens if you hit that other button at that exact same time, limiting the learnability of the solution.

As a final note, I will briefly return to Bioclipse, Jmol and layers. What Bioclipse and Jmol have in common is that they have a two-layer design (well, maybe more, but for the current argument I want to focus on two layers). The lower layer defines an API on top of which two applications are developed, both using the exact same underlying API: a GUI and a scripting language. Both Bioclipse and Jmol all GUI funtionality (or 90% at least) is expressed in terms of API calls. How that technically works, is a whole other story, but early on the developers of Bioclipse and Jmol decided that was a smart thing to do. In fact, both projects did not have this approach, and changed the design later, and the point here is that any new project should take advantage of that experience and express from the start:

  1. what are the targeted user types
  2. what is the layered model that is going to be used, to allow targeting all user type

Saturday, August 06, 2011

My Talks around Europe on Lanyrd linked to SlideShare

I recently discovered two talks I had completely forgotten about, and have now updated Lanyrd with all my past presentations (or at least those I can think of now). I then discovered that the website recognized coverage from SlideShare, such as my presentations, so I linked the slides for most of my presentations.

This is what my speaking history looks like:


And this is what a slide deck embedded on Lanyrd looks like:


Hacking Bioclipse scripts in Groovy

Users are demanding. Peter (of Specs) and I chatted yesterday afternoon briefly about the Bioclipse Scripting Langauge, and in particular how to append content in the JavaScript to an existing file. I do not know how to do that with either Bioclipse managers, nor with JavaScript.

So, I look up an old patch that updated the JavaScript console to use Groovy instead (which I also use in my Groovy Cheminformatics book, which just had a third edition out). And it still worked! On the train back home I cleaned up the code to use a separate console window, separate threads, etc, so that you can have a Groovy and a JavaScript console running in parallel (they do not share variables):


The print command still routes the text to something undefined, instead of the console.But it mostly works the same as with the JavaScript console, and all managers are available in the same way.

This means, one can now do fancy stuff like:

new File("/tmp/test.smi").eachLine { smiles ->
  if (smiles.length() > 0)
    js.say("" + cdk.fromSMILES(smiles))
}

Now, I just realize that the 'print' issue is actually worked around in the JavaScript console with a dedicated js manager, which I used above. But of course this routes the output to the JavaScript console, not the Groovy console :)

Update: I discovered that a JSR 223 provides a ScriptContext which allows one to overwrite the ScriptEngines standard output. That means practically, that I got the print to work properly now :)

Wednesday, August 03, 2011

Embedding remote JSON data in MediaWiki pages

Machine readable content is good for something. The actual format is not so important, and we can route RDF over JSON, so I'm fine with JSON. MediaWiki has a External Data (ED) extension that allows getting remote data in various formats, among which JSON. It works OK, but I have not figured out how to take advantage from hierarchy. If the some field shows up at various places in the hierarchy, ED still does not distinguish between them :(

Anyway, the obvious application it to show your last tweets on your MediaWiki homepage. Right?


The source for this content looks like:
= Twitter =
{{#get_web_data:url=http://api.twitter.com/1/statuses/user_timeline/{{{Twitter|}}}.json?count=3|format=json|data=text=text,id_str=id_str}}

{| class="table" border=0
! {{#if:{{{Twitter|}}}|Three Latest Tweets}} {{#for_external_table:
{{!}}-
{{!}} {{{text}}} }}
|}
The {{{Twitter|}}} bit hooks in to the template field that created that People box on the right hand side of that page, which you can see in the above screenshot with my twitter account set.

Of course, I have a hidden agenda here. The true reason is not twitter messages. It is being able to move data around. For example, wouldn't it be great to embed SPARQL results in wiki pages this way? Well, maybe there are better solutions, but the point is that this is technologically possible, and that we should think creatively in making mashups that help our scientific research.

Tuesday, August 02, 2011

My Google Scholar Citations profile arrived

Web of Science is my de fact standard for citation statistics (I need these for VR grant applications), and defines the lower limit of citations (it is pretty clean, but I do have to ping them now and then to fix something). The public front-end of it is Researcher ID. There is an Microsoft initiative, which looks clean but doesn't work on Linux for the nicer things, but the coverage of journals is pretty bad in my field, giving a biased (downwards) H-index. And CiteULike and Mendeley focus more on your publications than on citations (though the former has great CiTO support!).

Then Google Scholar Citations (GSC) shows up. While it does not look as pretty as competing products, it compensates that with a wide coverage of literature (for example, it supports the JChemInf, which Web-of-Science currently does not; and I happen to publish a lot in that journal recently), books, and reports, while keeping false positives fairly low. Thus, it provides an upper limit of my citations statistics, but one I am pretty happy confident about. And my H-index is quite comparable anyway. This is what my profile looks like:


So, these statistics have two purposes to me: 1. grant applications, and 2. I like to know what people based on my research. (Well, OK, 3. it helps me understand why I work so hard on too many things.)

Now the question is, will GSC take off. Will it replace ORCID? Will they join ORCID? Will GSC get a good API? Who will write the first userscript to make the GUI fancier? Will GSC support CiTO? Will GSC start using microformats or RDFa? What mashups can we expect between bibliographic databases? Will new entries automatically be posted to Google+? Will it have a button to autocreate a blog post when a paper gets cited 100, 500, or a 1000 times? Will GSC support #altmetrics?

ToxBank and SEURAT-1

Update: these are my personal experiences, and do not reflect that of other people and/or organizations.

The first half year of the ToxBank EU FP7 project (co-founded by Colipa; I think I am legally required to mention them, and I am happy to do so either case) I am working on (50%) has giving me mixed feelings. ToxBank has a great team of people (and great names in the other cluster project too!), and I am quite happy about the results we made. What results, you may ask. Well, indeed: they exist, and nice results too! It's just that they are not so visible.

That is the part I am less happy about: legalities. It took months for the consortium agreement to get finalized, and then there is a cluster agreement for the whole of SEURAT-1. Information is monopolized, and there is a general scare to accidentally release information which others may claim IP. It slows us down; it inhibits new collaborations and thus serendipity. Naive and idealistic as I am, I say this is bad for science.

But, the community is positive about Open nowadays, one achievement of the gold Open Access journals, I guess, and we hosted a workshop on Open Data recently too, which was well attended. People realize that openly sharing data has a role in science (see also this post).

Anyway, the goals are great of SEURAT-1. There now is an official website, and I am pretty sure I am not disclosing any trade secrets. SEURAT (Safety Evaluation Ultimately Replacing Animal Testing) has as ambitious goal to make animal testing obsolete. The ultimately is there to reflect this will not happen any time soon, but at least we die trying (this is also why there is a -1 in the name... there may be follow projects, as outlined in the Vision and Strategy). Of course, that following up makes Open Source and Open Data important, in my personal opinion. Fortunately, more and more people share that opinion. I hope my direct and/or indirect contributions to ToxBank can set an example.

The title of the -1 project is "Towards the replacement of in vivo repeated dose systematic toxicity testing". So, that will be more or less the focus of the ToxBank data warehouse. The types of data will be very diverse, and includes many areas of the omics space, including my favorite: metabolomics. To allow a systems toxicology approach, a short list of test compounds will be established. As part of ToxBank we have set up a Semantic Web system, allowing these compounds to be part of the Linked Data network. However, here comes the legal stuff again, and the wiki is not generally accessible; only to SEURAT-1 member (more precisely: will be very soon).

And that makes it impossible for me to start call in the community to run that favorite tools against these compounds. For example, to calculate solubilities in various solvents. That is information useful to our compound evaluation, and would contribute to the SEURAT-1 project! But I cannot do that right now :(

But, we're just 6 months into the five year project. I think we'll see a lot of firework later! The hurdles may slow us down, they will not mean we will not reach what we want.

Monday, August 01, 2011

ChEMBL-RDF as part of the Linked Open Data cloud?

This page nicely writes up what you need to do to make your RDF resource part of the Linked Open Data network. CKAN is used to aggregate facts about the resources, and I am finally getting around to adding the metadata describing how the ChEMBL data (CC-SA-BY) is linked to other LOD resources. This process is conveniently supported by a validator (see the screenshot on the right side).

The links out are mostly to various data sets of Bio2RDF. SPARQL helps me count the number of links to other LOD nodes. A typical query looks like:

SELECT count(DISTINCT ?value)
WHERE {
  ?resource ?p ?value .
  FILTER (regex(str(?value),"bio2rdf.org/pubmed"))
}

The str() function is used to allow regex() on URIs.

Right now, the data links out to four data sets, all via Bio2RDF:
Now it is waiting to see if this is enough to make the next LOD cloud.

Andreas' new CDK plugin for Taverna 2.3

Andreas released a new version of the CDK plugin for Taverna, which was forwarded on the Taverna news channel. Kalai uploaded several workflows to MyExperiment.org, and so did Andreas. Congrats to all developers involved!