Sunday, March 30, 2014

Linked Open Drug Data: three years on

Almost three years ago I collaborated with others in the W3C Health Care and Life Sciences interest group. One of the results of that was a paper in the special issue around the semantic web conference at one of the bianual, national ACS meeting (look at this nice RDFa-rich meeting page!). My contribution was around the ChEMBL-RDF, which I recently finally published, though it was already described earlier in an HCLS note.

Anyway, when this paper reached the most viewed paper position in the JChemInf journal, and I tweeted that event, I was asked for an update of the linked data graph (the darker nodes are the twelve the LODD task force worked on). A good questions indeed, particularly if you consider the name, and that not all of the data sets were really Open (see some of the things on Is It Open Data?). UMLS is not open; parts of SIDER and STICH are, but not all; CAS is not at all, and KEGG Cpd has since been locked down. Etc. A further issue is that the Berlin node in the LODD network is down, which hosted many data sets (Open or not). Chem2Bio2RDF seems down too.

Bio2RDF is still around, however (doi:10.1007/978-3-642-38288-8_14). At this moment, it is a considerable part of the current Linked Drug Data network. It provides 28 data sets. It even provides data from KEGG, but I still have to ask them what they had to do to be allowed to redistribute the data, and whether that applies to others too. Open PHACTS is new and integrated a number of data sets, like ChEMBL, WikiPathways, ChEBI, a subset of ChemSpider, and DrugBank. However, it does not expose that data as Linked Data. There is also the new (well, compared to three years ago :) Linked Life Data which exposes quite a few data sets, some originating from the Berlin node.

Of course, DBPedia is still around too. Also important that more and more data bases themselves provide RDF, like Uniprot which has a SPARQL end point in beta, WikiPathways, PubChem, and ChEMBL at the EBI. And more will come, /me thinks.

I am aggregating data in a Google Spreadsheet, but obviously this needs to go onto the DataHub. And a new diagram needs to be generated. And I need to figure out how things are linked. But the biggest question is: where are all the triples with the chemistry behind the drugs? Like organic syntheses, experimental physical and chemical data (spectra, pKa, logP/logD, etc), crystal structures (I think COD is working on a RDF version), etc, etc. And, what data sets am I missing in the spreadsheet (for example, data sets exposed via OpenTox)?

Friday, March 28, 2014

"Bridging WikiPathways and metabolomics data using the ChEBI ontology"

This week the ChEBI 3rd User Workshop took place, and I presented how WikiPathways is using ChEBI, and how I have been using it in the BridgeDb identifier mapping database for metabolites, and in mapping metabolites to WikiPathways using the ChEBI ontology.

Breaking News: CC-NC only for personal use!

What some of us already interpreted is that the Non-Commercial (NC) clause of the Creative Commons (CC) is a killer. German court has ruled that the NC clause means that the material is only for personal use. And that is literally breaking news! It means that such material is not Open Access in the context of (European) universities. I learned from Lessig's Free Culture (a must read) that academic use falls under fair use under USA law. but as far as I know this is not the case in Europe. It effectively means that all journals using a CC license with the NC clause now officially do not fall under most Open Access directives (AFAICS but IANAL).

(Image from WikiMedia.)

Sunday, March 16, 2014

Publishers #fail to innovate knowledge dissemination

Source: Wikipedia, public domain.
I have ranted often enough about publishing. I have also often enough indicated how publishers (or journals) could improve their act. Enough to find in the archives of this blog. Even the more innovative publishers have a long way to go. The reason why I blog about this, is why I can be happy with something like a rrdf package (doi:10.7287/peerj.preprints.185v3). Seriously, it is far away from where my heart is: understanding the underlying chemistry of biology. Really, I rather study how phosphorylation really causes signaling; at some level this is just protein interacting with another protein, small molecule, or something. But what? Still, the package makes me happy. No one else is doing it; I need it. We all need it to make science more reproducible. We need good tool and we do not need excuses for not doing it right (tm).

And just to make the point, we do need tools like this. We did 20 years ago. And publishers have done way too little. I really understand innovation is slow, is expensive. But, come on, use your imagination. I cannot solve everything in the world and really on others to implement stuff too. And here is an idea.

What if publishers could actually solve this problem. I know plenty of people are talking about it, and give it funny names, like nanopublications. That idea too existed for more than 20 years now. In fact, CMLRSS is not far from the nanopublication (doi:10.1021/ci034244p). And it was functional. Really, the implementation and standard is not even the issue. The key is adoption. Adoption may be slow, but it must exist. And for adoption to happen, you need commitment. For example, by promising that the time and resources invested in the adoption will have a return in investment. For example, have a guarantee that your solution won't go commercial at some point (causing a vendor lock in!).

But that something must happen is clear if you return to the science. Have you ever tried to do some theoretical study of some phenomenon? Than you know that data availability is a problem. And this data scarcity is exactly the reason why it has become valuable, and causing people to sit on top of it like a hen on her egg(s). If you ever have been involved in getting some good quality data together (ever noticed that much commercial data does not have the data you really need?), you know how expensive data is then. Recovering it costs more after the publishing process then before. Really, the original notebook has more information, likely be more informative then the formal publication.

Not just has the publishing model itself become more expensive than needed (just think about the APC of newer publishers, like PeerJ!), publishers also make access to the data more expensive than really needed.

This is a huge fail is the Western approach to science: we enormously disrespect data.

If you are not convinced, please give me answers to these questions (read active ingredient for "drug"):

  1. how were the CYP experiments performed for the top ten selling drugs and what are the main human transformations?
  2. what is the experimental errors on pKa measurements of the top ten selling drugs (uncharged and single charged, positive and negative)?
  3. how were the logP values measured for the top ten selling drugs and at what pH?
  4. what are the size distributions of samples of nanomaterials reported in literature?
  5. what are the different forms of a protein (not shape, but in terms of structure; so, phophorylation states, exact position, relevant SNPs, etc) of the top ten proteins relevant to pancreatic cancer?
If you can answer any of these questions in less than one hour with provenance (list of DOI and/or PubMed IDs), then I love to hear that. It would give an estimate of the problem. However, my estimate currently is that you cannot fully answer these questions, and most certainly not within one day. Had publishers taken their goal of knowledge dissemination seriously in the past 20 years, it would have been a lot simpler. But they failed. Why should I trust them to do better in the next 20 years? Meanwhile, with the limited funding I get, I will keep being happy with things I can contribute.

Now, if you do not understand why those details matter, start doing a multivariate statistics course. </rant>

Saturday, March 15, 2014

CiteULike to Twitter? IFTTT!

Twitter is great! Minutes after I asked this online:
Sarah Pohl replied:
Alex Henderson complemented the answer informing me RSS is supported:
So, I signed up to IFTTT and made a recipe, and life is good:
I wonder if I can get this to work with CMLRSS (see doi:10.1021/ci034244p) in some way... it would be brilliant to route molecular structures from the Crystallography Open Database into ChemSpider and PubChem automatically, wouldn't it?

Saturday, March 08, 2014

Reviewing CDK patches in the Maven era

Three weeks ago the CDK project migrated from Ant to Maven as the primary build tool. That means that my workflow for making and, importantly, reviewing patches is completely turned upside down. Well, that happens.

My patch reviewing workflow looks like:
  1. run the test suite and capture the number of JUnit Errors and Fails
  2. apply the patch and check if things still compile
  3. run the test suite and capture the number of JUnit Errors and Fails
  4. compare the number of Errors and Fails before and after
  5. check if JavaDoc is in order
  6. check if there is new unit testsing where appropriate
  7. check for new PMD issues
In there issues I always had CDK Nightly as backup, and this is now replaced by Jenkins; e.g. check this instance at the EBI. This workflow now translate to something like this (the extraction of the results was suggested by John):
  1. mvn clean compile test -Dmaven.test.failure.ignore=true
  2. cat */*/target/surefire-reports/* | grep "Tests run" | sed -e "s/, Time elapsed.* /\|/" | sort -t'|' -k2 > prepatch.txt
  3. git am / git cherry-pick
  4. repeat step 1 and 2, and safe as postpatch.txt
  5. diff -u prepatch.txt postpatch.txt
  6. repeat step 1-5, if needed.
    And if all is good, then the diff should show no new fails and possibly even less. During a set of patches, things may be temporary failing, such as in this case:

    diff -u prepatch.txt postpatch.txt 
    --- prepatch.txt        2014-03-08 11:41:13.520240111 +0100
    +++ postpatch.txt       2014-03-08 12:59:21.022609259 +0100
    @@ -3,6 +3,14 @@
     Tests run: 22, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.atomtype.ReactionStructuresTest
     Tests run: 1, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.CDKTest
     Tests run: 10, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.formula.rules.IsotopePatternRuleTest
    +Tests run: 15, Failures: 0, Errors: 10, Skipped: 0|org.openscience.cdk.graph.CyclesTest
    +Tests run: 14, Failures: 0, Errors: 14, Skipped: 0|org.openscience.cdk.graph.EdgeShortCyclesTest
    +Tests run: 12, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.EssentialCyclesTest
    +Tests run: 31, Failures: 0, Errors: 18, Skipped: 0|org.openscience.cdk.graph.InitialCyclesTest
    +Tests run: 14, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.MinimumCycleBasisTest
    +Tests run: 14, Failures: 0, Errors: 12, Skipped: 0|org.openscience.cdk.graph.RelevantCyclesTest
    +Tests run: 13, Failures: 0, Errors: 11, Skipped: 0|org.openscience.cdk.graph.TripletShortCyclesTest
    +Tests run: 14, Failures: 0, Errors: 14, Skipped: 0|org.openscience.cdk.graph.VertexShortCyclesTest
     Tests run: 2, Failures: 2, Errors: 0, Skipped: 0|
     Tests run: 14, Failures: 5, Errors: 0, Skipped: 0|org.openscience.cdk.modeling.builder3d.ForceFieldConfiguratorTest
     Tests run: 15, Failures: 1, Errors: 0, Skipped: 0|org.openscience.cdk.qsar.descriptors.atomic.AtomDegreeDescriptorTest

    Oh, and intermediate compiles I can do without running the tests with:

        mvn compile -DskipTests