Tuesday, January 14, 2014

Revisited: Handling SD files with JavaScript in Bioclipse

After asking on the Bioclipse users list it turns out there was an unpublished manager method to trigger parsing of the SDF properties (Arvid++), allowing to simplify creation of the index and not needed parsing of the chemical structures into a CDK molecule model.

That simplifies my earlier code to:

  hmdbIndex =
  props = new java.util.HashSet();
  molTable.parseProperties(hmdbIndex, props);
  idIndex = new java.util.HashMap();
  molCount = hmdbIndex.getNumberOfMolecules();
  for (i=0; i<molCount; i++) {
    hmdbID = hmdbIndex.getPropertyFor(i, "HMDB_ID")
    idIndex.put(hmdbID, i);

The next step in my use case is process some input (WikiPathways GPML files to be precise), detect what HMDB identifier is used, extract the SD file entry for that identifier and append it to a new SD file (using a new ui.append() method):

  hmdbCounter = idIndex.get(idStr)
  sdEntry = hmdbIndex.getRecord(hmdbCounter)
  sdEntry = sdEntry.substring(0, sdEntry.indexOf("M  END"))
  ui.append("/WikiPathways/db.sdf", sdEntry);
  ui.append("/WikiPathways/db.sdf", "M  END\n");
  ui.append("/WikiPathways/db.sdf", "> <WPM>\n");
  ui.append("/WikiPathways/db.sdf", "WPM" + (Integer.toString(wpmId)).substring(1) + "\n");
  ui.append("/WikiPathways/db.sdf", "\n");
  ui.append("/WikiPathways/db.sdf", "\$\$\$\$\n");

This code actually does a bit more than copying the SD file entry: it also removes all previous SD fields and replace this with a new, internal identifier. Using that identifier, I track some metadata on this metabolite.

Now, there are a million ways of implementing this workflow. If you really want to know, I chose this one because HMDB identifiers is a more prominent ID used in WikiPathways, and for this one, as well as ChEBI, I can use a SD file. For ChemSpider and PubChem identifiers, however, I plan to use the matching Bioclipse client code to pull in MDL molfiles. Bioclipse has functionality for all these needs available as extensions. 

Tuesday, January 07, 2014

Handling SD files with JavaScript in Bioclipse

I finally got around to continuing with a task to create an SD file for WikiPathways. The problem is more finding the time, than doing it, and the tasks are basically:
  1. iterating over all metabolites in the GPML files
  2. extract the Xref's database and database identifier (see previous link)
  3. extract the molfile from the database SD file
  4. give the WikiPathways metabolite a unique identifier
  5. record that WikiPathways metabolite has a molfile
  6. append that molfile along with the new WikiPathways metabolite ID in a new SD file
It turns out that I can use Uppsala's excellent SD functionality in Bioclipse (using indexing, it opens 2 GB SD files for me) is also available from the JavaScript command line:

  hmdbIndex = molTable.createSDFIndex(
  idIndex = new java.util.HashMap();
  molCount = hmdbIndex.getNumberOfMolecules();
  for (i=0; i<molCount; i++) {
    mol = hmdbIndex.getMoleculeAt(i);
    if (mol != null) {
      hmdbID = mol.getAtomContainer().getProperty(
      idIndex.put(hmdbID, i);

Using this approach, I can create an index by HMDB identifier of molfiles in the HMDB SD file extract just those molfiles which are found in WikiPathways, and create a new WikiPathways dedicated SD file. When I have the HMDB identifiers done, ChEBI, PubChem, and ChemSpider will follow.

Friday, January 03, 2014

rrdf 2.0: Updates, some fixes, and a preprint

It all started 3.5 years ago with a question on BioStar: how can one import RDF into R and because lack of an answer, I hacked up rrdf. Previously, I showed two examples and a vignette. Apparently, it was a niche, and I received good feedback. And it is starting to get cited in literature, e.g. by Vissoci et al. Furthermore, I used it in the ropenphacts package so when I write that up, I like to have something to refer people to for detail about the used rrdf package.

Thus, during the x-mas holidays I wrote up what I had in my mind, resulting in this preprint on the PeerJ PrePrints server, for you to comment on.

Yes, please go ahead, read it, try the package ("install.packages(pkgs=c("rrdf"))"), ask questions, and comment on the preprint. I anticipate to submit it to a peer-reviewed journal by the end of this month.

Along with the preprint, I updated the rrdf package (now at 2.0.2). These are the changes:

  • added methods to read/write RDF from strings
  • updated to Apache Jena 2.11 (yeah, I am hardly doing things from scratch)
  • remote SPARQLing now skips Jena by default (add jena=TRUE to use it again)
  • better conversion of SPARQL results to a matrix
    • proper removal of language and types
    • better handling of anonymous nodes
    • also output variables (columns) if they have no data (thank to Alan Ruttenberg)
  • fixed the output of supported formats in documentation and error messages
Have fun!

Vissoci, J. R. N. et al. A framework for reproducible, interactive research: Application to health and social sciences (2013). URL