Pages

Sunday, September 16, 2018

Data Curation: 5% inspiration, 95% frustration (cleaning up data inconsistencies)

Slice of the spreadsheet in the supplementary info.
Just some bit of cleaning I scripted today for a number of toxicology end points in a database published some time ago the zero-APC Open Access (CC_BY) journal Beilstein of Journal of Nanotechnology, NanoE-Tox (doi:10.3762/bjnano.6.183).

The curation I am doing is to redistribute the data in the eNanoMapper database (see doi:10.3762/bjnano.6.165) and thus with ontology annotation (see doi:10.1186/s13326-015-0005-5):

  recognizedToxicities = [
    "EC10": "http://www.bioassayontology.org/bao#BAO_0001263",
    "EC20": "http://www.bioassayontology.org/bao#BAO_0001235",
    "EC25": "http://www.bioassayontology.org/bao#BAO_0001264",
    "EC30": "http://www.bioassayontology.org/bao#BAO_0000599",
    "EC50": "http://www.bioassayontology.org/bao#BAO_0000188",
    "EC80": "http://purl.enanomapper.org/onto/ENM_0000053",
    "EC90": "http://www.bioassayontology.org/bao#BAO_0001237",
    "IC50": "http://www.bioassayontology.org/bao#BAO_0000190",
    "LC50": "http://www.bioassayontology.org/bao#BAO_0002145",
    "MIC":  "http://www.bioassayontology.org/bao#BAO_0002146",
    "NOEC": "http://purl.enanomapper.org/onto/ENM_0000060",
    "NOEL": "http://purl.enanomapper.org/onto/ENM_0000056"
  ]  

With 402(!) variants left. Many do not have an ontology term yet, and I filed a feature request.

Units:

  recognizedUnits = [
    "g/L": "g/L",
    "g/l": "g/l",
    "mg/L": "mg/L",
    "mg/ml": "mg/ml",
    "mg/mL": "mg/mL",
    "µg/L of food": "µg/L",
    "µg/L": "µg/L",
    "µg/mL": "µg/mL",
    "mg Ag/L": "mg/L",
    "mg Cu/L": "mg/L",
    "mg Zn/L": "mg/L",
    "µg dissolved Cu/L": "µg/L",
    "µg dissolved Zn/L": "µg/L",
    "µg Ag/L": "µg/L",
    "fmol/L": "fmol/L",
    
    "mmol/g": "mmol/g",
    "nmol/g fresh weight": "nmol/g",
    "µg Cu/g": "µg/g",
    "mg Ag/kg": "mg/kg",
    "mg Zn/kg": "mg/kg",
    "mg Zn/kg  d.w.": "mg/kg",
    "mg/kg of dry feed": "mg/kg", 
    "mg/kg": "mg/kg",
    "g/kg": "g/kg",
    "µg/g dry weight sediment": "µg/g", 
    "µg/g": "µg/g"
  ]

Oh, and don't get me started on actual values, with endpoint values, as ranges, errors, etc. That variety is not the problem, but the lack of FAIR-ness makes the whole really hard to process. I now have something like:

  prop = prop.replace(",", ".")
  if (prop.substring(1).contains("-")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("±")) {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${oboNS}STATO_0000035",
      prop, "${xsdNS}string"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  } else if (prop.contains("<")) {
  } else {
    rdf.addTypedDataProperty(
      store, endpointIRI, "${ssoNS}has-value", prop,
      "${xsdNS}double"
    )
    rdf.addDataProperty(
      store, endpointIRI, "${ssoNS}has-unit", units
    )
  }

But let me make clear: I can actually do this, add more data to the eNanoMapper database (with Nina), only because the developers of this database made their data available under an Open license (CC-BY, to be precise), allowing me to reuse, modify (change format), and redistribute it. Thanks to the authors. Data curation is expensive, whether I do it, or if the authors of the database did. They already did a lot of data curation. But only because of Open licenses, we only have to do this once.