Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Transforming Coded Numbers with Math Functions

Screenshot showing coded numbers for age from a dataset

A common necessary correction in biomedical datasets is correction of coding-based data errors. As we saw in the Numeric facets exercise, NEISS coders were instructed to record the ages of patients with age expressed in months with a “2” at the beginning, making it appear at first glance that the patient is over 200 years. While intended to keep ages in years separate for processing purposes from ages in months, this will require correction either in the statistical analysis or other programs with which researchers will crunch this data, or here in OpenRefine before the dataset reaches that stage. The correction needed is a simple mathematical correction that can be done by subtracting 200, and dividing the resulting months by 12.0 to see the months expressed as a fraction of a year. The “’0” after the 12 is necessary when working with GREL expressions – it tells OpenRefine to express the result as a decimal.

Steps to Transform the Coded Ages

  • From the Age drop down, choose Facet - Numeric Facet. Use the sliders in the graph on the left to isolate only the ages from 200 and up.
  • From the Age drop down, choose Edit Cells - Transform

     

Screenshot showing the menu path to writing transformations in OpenRefine

  • Subtract 200 from each value to remove the unneeded “2” and isolate age in months. Divide by 12.0 (the .0 ensures a result in decimal format)
  • To do this, in the GREL expression box write the simple formula: (value - 200)/12.0
  • GREL supports the mathematical operators: +, -, /, and *

Screenshot showing a custom text tranformation to update a coded number