Pages

Saturday, August 31, 2013

The Dutch Dataverse Network: a host for the ChEMBL-RDF v13.5 data, and some thoughts in workflow integration

Last Thursday, there was a UM library network drink. And as I see a library where knowledge is found, and libraries still rarely think of knowledge as ever being able to be stored outside books and papers, I was happy to see the library promoting the Dutch Dataverse Network. So, I had to try it, and see if it fulfilled my basic needs:

  1. shows under what conditions people can download, modify, and redistribute data;
  2. has a high visibility on the web; and
  3. it is open source and developed by @thedataorg.
And it does: and here is the v13.5 data behind the ChEMBL-RDF paper (which you can also query via this SPARQL end point at Uppsala University):


Now, Chris Evelo wondered about the purpose of this. Can we import the data in analysis platforms? How does it related to efforts our group is involved in, like dbNP etc?

Well, my primary reasons for doing the above was to test the system and to serve more than 700MB of RDF triples. Thus, with the testing of the system done, the question now is, can we use it in data analysis platforms. I found this R package by Thomas J. Leeper. It has several methods, some of which are given in the below example code:

search = dvSearch(
  dv="https://www.dataverse.nl/dvn/",
  list(authorName="willighagen")
)

This shows us the handle of my one and currently only data set, hdl:10411/10279. We need some further information, such as the formats of the records metadata:

formats = dvMetadataFormats(
  dv="https://www.dataverse.nl/dvn/",
  search$objectid[1]
)

When we combine this, we can retrieve the metadata:

metadata = dvMetadata(
  dv="https://www.dataverse.nl/dvn/",
  search$objectid[1],
  format.type=formats$formatName[1]
)

Now, the package then has a dvExtractFileIds() method to extract the file names. But the metadata for my record is not compatible, and you need this code instead (and this has likely to do with me not knowing the Dataverse system in enough detail to use it properly):


extractOtherMatFileIds = function (xml) 
{
    nodes <- xmlChildren(xmlChildren(xmlParse(xml))$codeBook)
    dscrs <- nodes[names(nodes) == "otherMat"]
    d <- data.frame(matrix(nrow = length(dscrs), ncol = 4)) 
    names(d) <- c("fileName", "fileId", "level", "URI")
    for (i in 1:length(dscrs)) {
        attrs <- xmlAttrs(dscrs[[i]])
        d$fileName[i] <- xmlValue(xmlChildren(dscrs[[i]])$labl)
        d$level[i] <- attrs[names(attrs) == "level"]
        d$URI[i] <- attrs[names(attrs) == "URI"]
        d$fileId[i] <- strsplit(d$URI[i], "fileId=")[[1]][2]
    }
    return(d)
}

So that, similar to the dvn package help PDF, we can continue with:

files = extractOtherMatFileIds(metadata)
info <- dvDownloadInfo(
  dv="https://www.dataverse.nl/dvn/",
  files$fileId[1]
)


We are now ready to download the data, but, except for a small bug in the package, we run into a wall:

data <- dvDownload(
  dv="https://www.dataverse.nl/dvn/",
  files$fileId[1]
)

We we do not get access, and get this error instead:


Error in dvDownload(dv = "https://www.dataverse.nl/dvn/", files$fileId[1]) : 
  
    Terms of Use apply.
  
Data cannot be accessed directly...try using URI from dvExtractFileIds(dvMetadata())


This is despite me marking the data as Public. I do not know the reason yet, but it could have to do with setting the CC-BY-SA license? They indeed do apply, but that doesn't mean an anonymous user cannot download it. The error seems to originate from info in here:

dvQuery(
  dv="https://www.dataverse.nl/dvn/",
  verb = "downloadInfo",
  query = files$fileId[1]
)

This code is part of the download function, which returns a XML snippet with this part:

<accessRestrictions granted="false">

The workaround suggested in the error message is to just use the URI, using default R functionality:

download.file(
  files$URI[1],
  files$fileName[1]
)

However, this does not return the data file, but a HTML page that allows one to accept the terms of use. Of course, we can use the browser option in several of the methods, but any user interaction makes downloading data in a workflow setting unrealistic.

Now, it should be able to automate that, not? We should be able to instruct the dvn package and the Dataverse network to always accept Creative Commons licenses, right?

Saturday, August 24, 2013

The Blue Obelisk Data Repository's 10 release

The Blue Obelisk Data Repository (BODR) is not so high profile as other Blue Obelisk projects, but equally important. Well, maybe a tid bit more important: it's a collection of core chemical and physical data, supporting computation chemistry and cheminformatics resources. For example, it is used by at least the CDK, Kalzium, and Bioclipse, but possibly more. Also, it's packages for major Linux distributions, such as Debian (btw, congrats to their 20th birthday!) and Ubuntu.

It doesn't change so often, but just has seen its 10th release. Actually, it was the first release in more than three years. But, fortunately, core chemical facts do not change often, nor much. So, this release has a number of data fixes, a few recent experimental isotope measurements, and also includes the new official names of the livermorium and flerovium elements. There is a full overview of changes.

BODR 10 is brought to you by Jean Brefort, Daniel Leidert, and I did some small bits too. Also big thanks to all project that keep using BODR and contribute by providing high quality feedback reports!

Oh, and if you use BODR isotope or element data, you are kindly invited to cite the one of the Blue Obelisk papers.

O'Boyle, N. et al. Open data, open source and open standards in chemistry: The blue obelisk five years on. Journal of Cheminformatics 3, 37+ (2011). URL http://www.jcheminf.com/content/3/1/37.
Guha, R. et al. The blue obelisk - interoperability in chemical informatics. Journal of Chemical Information and Modeling 46, 991-998 (2006). URL http://dx.doi.org/10.1021/ci050400b.

Friday, August 16, 2013

Analyzing WikiPathways metabolites in Bioclipse is easy with Groovy

Assume you downloaded a set of GPML pathway files from WikiPathways (doi:10.1371/journal.pbio.0060184) and placed those in a Bioclipse (doi:10.1186/1471-2105-10-397) workspace project, then you can easily analyse all metabolites:


Well, genes and proteins too, but I just happen to like metabolites more.

In fact, more interesting than printing the database source and identifier is perhaps opening them in a molecule table. Because I have not update the BridgeDb plugin to easily load identifier mapping databases, let's just use OPSIN (which recently saw its 1.5.0 release) and accept that we don't get to see all metabolites just yet:
    dataMap = bioclipse.fullPath("/WikiPathways/data/")
    gpmlFiles = new File(dataMap).listFiles()

    structureList = cdk.createMoleculeList()
    gpmlFiles.each { file ->
      def data = new XmlParser().parse(file)
      def metabolites = data.DataNode.findAll{
        it.'@Type'.contains('Metabolite')
      }
      metabolites.each() { node ->
        name = node.'@TextLabel'.trim()
        try {
          molecule = opsin.parseIUPACName(name)
          js.print("IUPAC name found: $name \n")
          structureList.add(molecule)
        } catch (Exception exception) {
          // OK, it was not an IUPAC name
        }
      }
    }
    ui.open(structureList)
Then we get to see this (for pathways with names starting with a "B"):


Anyway, this is just playing around. The point is, we can now hook up metabolite information in WikiPathways with any of the other functionality in Bioclipse, such as toxicity prediction, decision support, structural analysis (with the CDK), database look ups, etc, etc.

Or, and that was actually my primary goal this afternoon, to find all GPML Label elements with IUPAC names. But more on that next week.

Saturday, August 10, 2013

An #Altmetrics page in Wikipedia

Today I learned that #altmetrics did not have a Wikipedia page. It's notable, so I decided to get together some material, citations, etc, and create a page on Altmetrics.


In my opinion, the #altmetrics work is a lot more informative in judging a paper (or researcher) than the journal impact factor (JIF). Still, the JIF is used at many academic institutes to decide on the future of researchers, despite it being uncorrelated with the quality or impact of the paper. This is upsetting people, among 150 scientists and 75 organizations.

I previously blogged about #altmetrics in:

Monday, August 05, 2013

CDK 1.4.19: the changes, the authors, and the reviewers

Well, with that few issues (really, I am seriously impressed!), I cannot withhold the CDK community from a new stable release. After all, you must be jumping to implement that new 1.4 version in your tools! (Seriously, I am wondering if we can compose a list of active CDK-based projects/code bases, and what CDK version they use....)

This release is another bug fix release, including a set of fixes for long standing regressions, though in code that is not used by a lot of people. It also contains two convenience methods in the IsotopeFactory to get all isotopes, and all isotopes given a certain exact mass and maximal difference with that search mass. It also has an update for element names, now knowing about the Lv and Fl elements.

Additionally to the regression fixes, this release also contains a number of fixes to newer bug reports. For example, it fixes the matching of any bonds in searching, if the target bond is null (yeah, one of those weird corner cases that should not really happen), a similar null problem when adding the hydrogens, the group of the Ds element, and how generated SMILES handles nitrogens in rings with respect to implicit hydrogens.

Of course, there is also some general code and JavaDoc clean up. Still a lot to do their.

The changes
  • Exact mass and natural abundance is not preserved on reading/writing CML. As these attributes are boxed primitives they can be null and throw an exception when unboxed by 'assertEquals'. Before checking the values the attributes are not checked for nullity. a92d226
  • Resolves unit test failure (previous commits). When a double bond is found and there are no 2D coordinates (e.g. unspecified configuration) then return then return '0' for the configuration value. c29a6b4
  • Backport for cdk-1.4.x: use IMolecule in IMoleculeSet 1e0cfdb
  • When a reaction references a molecule which is unknown - automatically create one with that Id. This happens when the set of molecules is defined after the reaction. This commit resolve the test in error 'CML2Test. testBug2697568' 5d8f2f8
  • Resolves two long standing errors in 'cdk-extra'. The method should throw an exception when one tries to attach to an invalid atom number (e.g. 7-chlorohexane). The existing error was using the ParseException constructor incorrectly and thus would throw an error. The constructor is for generating error messages to do with syntax. As this is a semantic error the constructor did not function as intended. Simply replacing the use of this constructor with a normal error message resolves the issue. 4f5b704
  • Two convenience methods to make dealing with BODR data a lot easier c4c63dd
  • Added Lv and Fl names and two missing elements (Uup and Uut) used in Elements.java d5b5947
  • Fixed the import, matching the cdkdeps c3dd062
  • Any bond order is fine, but the IBond must exist (fixes #1305) 1795481
  • Added a positive unit test too bda3e23
  • Basic unit test for bug 1305: null should not match AnyOrderQueryBond 0b6c93d
  • Fixed the group for Ds (fixes #1282) 258b3f0
  • Removing output from unit tests. ddd8fd8
  • Unit tests for optional property writing. fb3d603
  • white-listing for export properties 72eb921
  • Do not output isotope info in PDB-CML files (fixing the last roundtripping inconsistency 084fd88
  • Output the monomers in a sorted manner, to make the CML output more reproducible eb7a953
  • Do not output empty properties b3719ee
  • Factored out the test for PDB atom customization; also fixed the test to register the customizer (but there is more wrong) 1a99bb2
  • Use interfaces instead of implementations 588951f
  • Links from the JavaDoc c86fa16
  • Test if the monomer ID is read 3943f6b
  • Invert test assertion - the molecule is now automatically configured and the structures are the same. 007dcf4
  • Tidy up of HydrogenPlacer including test annotations and additional assertions about placed/unplaced atoms. 597bfe2
  • Resolution for bug1269 - stop the hydrogen placer attempting to place hydrogens on null atoms. 3207c58
  • When ring membership is specified without a number, match any ring atom. Resolves bug 1168. 7c47f79
  • Changed the layout of radicals to have a smaller gap between them. a80c30c
  • Scale AtomRadius in RadicalGenerator 9be7452
  • Yeah, Ant 1.9.x is fine too ea5ad5d
  • SpanningTree documentation 26edde8
  • Fixed generation of SMILES with non-charged nitrogen rings with an implicit hydrogen (bug #1300) 4744fa3
  • Unit test that checks that implicit hydrogens on ring nitrogens are only added when they are negatively charged (bug #1300) 197fa13
  • Determine earlier if the nitrogen needs an explicit implicit hydrogen (you know, like [nH], solving the double bracket problem b036b3e
  • Added a unit test for the double brackets SMILES bug 2503b7f
  • Removing old @cdk.builddepends tag. b02ebdd
  • Removes old @cdk.depends tag from javadoc. d9d83b1
  • Added Magda Oprian to AUTHORS 034c706
The authors
 23  Egon Willighagen
 13  John May
  2  Arvid Berg

  1  Joos Kiener

Arvid, from the Uppsala team, had some patches for the rendering stack not yet applied, and I welcome Joos for submitting, I think, first patch.

The reviewers
     14  Egon Willighagen 
     17  John May 

Wow, the CDK hits a record low number of unit test fails!

As many of my readers know, John May recently started working as release manager of the CDK development branch, e.g. resulting in the CDK 1.5.3 development release. He has done very important work for the CDK otherwise too. He is clearly beyond the point of an active contributor, and putting his coding where is mouth is, and is improving the CDK all over the place (read his blog!).

And one of those itches he has (read The Cathedral and the Bazaar) is the unit tests. In fact, I like them too. Seeing in a table how many known issues there are, really encourages you at tackling them. And seeing them go down with every commit is very rewarding, or at least for me, and apparently for John too.

Anyway, in his drive to make the development branch of the CDK "stable", he is fixing quite a few long standing issues. That is hard work. You first need to get an idea of what goes wrong, when it started going wrong, what caused it, what the code was originally supposed to do, and only then you can start thinking of a fix. Well, he does. Repeatedly. And the result is shocking:

Never in the history of the CDK (well, at least not after we seriously started using unit tests), the number of fails (and errors) has been this low: 66 fails, 3 errors! Seriously, consider that it has even been higher before I introduced those ever failing coverage unit tests! We never really got below some 70 failing tests. Seriously. I mean, come on.

Bottom line is, the current CDK, not just the stable branch, but even master is more stable than CDK 1.0 ever was. Or any CDK version ever (perhaps except of the code by Christoph before he shared it with the world ;)

John, thanx! Your efforts give me a lot of motivation to continue to work on the CDK myself!

Friday, August 02, 2013

Choosing #OpenSource licenses: GitHub provides an overview

Picking a software license can be tricky. You want to allow certain things, require another few things, but certainly do that other thing. I can very much recommend Rosen's Open Source Licensing book (website down?), but GitHub is now also providing a quick overview worth checking out:


Via lwn.net.

Rosen, L. Open Source Licensing (2004).

Thursday, August 01, 2013

CTR #8: Unique SMARTS matches against a SMILES string

Of course, I had hardly numbered CTR #7 when I realized that I should solve the SMARTS matching CTR first. But because I had already numbered #7 I had to name this one #8. You know, for historic consistency and not meddling with your lab notebook.... life sucks.

Anyway, Rajarshi wrote a convenient SMARTSQueryTool for the CDK, which makes this CTR rather trivial. The hardest bit is the workaround for a limitation of the edge-based graph matching used by the CDK UniversalIsomorphismTester (cyclopropane and isobutane are indistinguishable at an edge level, but easily separated by matching atom count):

import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.smiles.*;
import org.openscience.cdk.smiles.smarts.*;
import org.openscience.cdk.silent.SilentChemObjectBuilder;
 
SmilesParser sp = new SmilesParser(SilentChemObjectBuilder.getInstance());
atomContainer = sp.parseSmiles("C1CC12C3(C24CC4)CC3");
querytool = new SMARTSQueryTool("*1**1");
 
found = querytool.matches(atomContainer);
if (found) {
  mappings = querytool.getMatchingAtoms()
  hits = 0
  for (int i = 0; i < mappings.size(); i++) {
    atomIndices = mappings.get(i);
    if (atomIndices.size() == 3) {
      // work around the cyclopropane / isobutane equivalence
      hits++
    }
  }
  println "hits: $hits"
 
  mappings = querytool.getUniqueMatchingAtoms()
  uniqueHits = 0
  for (int i = 0; i < mappings.size(); i++) {
    atomIndices = mappings.get(i);
    if (atomIndices.size() == 3) {
      // work around the cyclopropane / isobutane equivalence
      uniqueHits++
    }
  }
  println "unique hits: $uniqueHits"
}

To see all solutions, check the full list of problems in my blog.

CTR #7: Highlight a substructure in the depiction

I have previously blogged about how to use the CDK and CDK-JChemPaint to highlight a substructure in a 2D drawing, and I only needed to extend it with SMARTS substructure search code, and I added up with this (resulting in the drawing on the right):

import java.util.List;
import java.awt.*;
import java.awt.image.*;
import java.util.zip.GZIPInputStream;
import javax.imageio.*;
import org.openscience.cdk.*;
import org.openscience.cdk.interfaces.*;
import org.openscience.cdk.io.*;
import org.openscience.cdk.io.iterator.*;
import org.openscience.cdk.layout.*;
import org.openscience.cdk.renderer.*;
import org.openscience.cdk.renderer.font.*;
import org.openscience.cdk.renderer.generators.*;
import org.openscience.cdk.renderer.visitor.*;
import org.openscience.cdk.renderer.generators.BasicSceneGenerator.Margin;
import org.openscience.cdk.renderer.generators.BasicSceneGenerator.ZoomFactor;
import org.openscience.cdk.silent.*;
import org.openscience.cdk.smiles.smarts.*;
import org.openscience.cdk.templates.*;
import org.openscience.cdk.tools.manipulator.*;
 
int WIDTH = 250;
int HEIGHT = 200;
// the draw area and the image should be the same size
Rectangle drawArea = new Rectangle(WIDTH, HEIGHT);
Image image = new BufferedImage(
  WIDTH, HEIGHT, BufferedImage.TYPE_INT_RGB
);
iterator = new IteratingMDLReader(
  new GZIPInputStream(
    new File("ctr/benzodiazepine.sdf.gz")
      .newInputStream()
  ),
  SilentChemObjectBuilder.getInstance()
)
iterator.setReaderMode(
  IChemObjectReader.Mode.STRICT
)
compound3016 = null
while (iterator.hasNext() && compound3016 == null) {
  mol = iterator.next()
  if ("3016".equals(mol.getProperty(CDKConstants.TITLE)))
    compound3016 = mol
}
compound3016 =
  AtomContainerManipulator
    .removeHydrogens(compound3016)
StructureDiagramGenerator sdg =
  new StructureDiagramGenerator();
sdg.setMolecule(compound3016);
sdg.generateCoordinates();
compound3016 = sdg.getMolecule();
// generators make the image elements
List generators =
  new ArrayList();
generators.add(new BasicSceneGenerator());
generators.add(new ExternalHighlightGenerator());
generators.add(new BasicBondGenerator());
generators.add(new BasicAtomGenerator());
selection = new AtomContainer();
querytool = new SMARTSQueryTool(
  "c1ccc2c(c1)C(=NCCN2)c3ccccc3"
);
querytool.matches(compound3016);
if (querytool.countMatches() > 0) {
  mappings = querytool.getUniqueMatchingAtoms()
  mapping = mappings.get(0)
  for (int i=0; i