Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Timeline, Scatterplot, and Numeric Facets

OpenRefine displays data in a tabular way, much like a spreadsheet. When a project is first loaded, a row is equivalent to a record, and columns represent the types of data within the records, or the variables. Any individual data entry in the interface is called a value, and values are contained in individual cells.

Though the text facet is one of the most commonly used facets in OpenRefine (Text facet counts unique values and displays them, allowing quick and easy correction of data-entry errors), other commonly used ones are Timeline, Scatterplot, and Numeric facets.

We’ll use the Timeline facet on the Treatment Date Column in the sample dataset. First the dates, which have been imported into the project in text format, must be changed to date format. From the Treatment Date column’s drop down arrow, choose Edit Cells – Common transforms – To date. The dates will then convert to an ISO standard date time format, in which times have been appended with all zeroes, as times were not available in the original data. Although this looks messy, it will allow the Timeline facet to work.

From the column’s drop down, choose Facet – Timeline facet and review the resulting chart.

The dates are listed from the beginning to the end of 2019, with the bars representing the frequency of treatments on each date. Use the sliders on either end of the chart to hone in on a range of dates, or even a single date. Notice that the number of rows you’re viewing based on the range of dates you choose adjusts in real time.

The scatterplot facet is a quick and easy way to get a visual representation of how the values in two numeric columns relate to each other. We can try this facet on the Age and Weight columns of the sample dataset. First, from the “All” drop down arrow in the far left, choose “Edit Columns – Reorder/Remove Columns.” Weight appears at the bottom of this list; drag it close to the top, so it appears just after Age. Then from the drop down arrows on both the Age and Weight columns, choose Edit cells – Common transforms – To number.

From the Age column drop down arrow, choose Facet – Scatterplot facet. OpenRefine will automatically generate the scatterplot graph using the number-based column to the right of the column we selected, which is Weight.


Like the scatterplot facet, the numeric facet provides a quick visual representation of numeric values in a column, by counting the values' frequencies of occurrence and displaying them in a small histogram.

To begin, from the Age column drop down arrow in the sample dataset choose Facet – Numeric facet.

Much like the Timeline facet, this will result in a graph with sliders in the Facet/Filter area on the left of the screen. Use the sliders to narrow in on a subset of the dataset’s ages. However, there’s something strange in this example – the patients’ ages go up to 230! This happened because in the NEISS coding manual, coders were instructed to put a 2 in the first space of the age code whenever a patient’s age was expressed in months. (Ages from 0 – 4 weeks were grouped in the 1-month-old category). In the next exercise we'll transform these values to display properly as a fraction of a year.

A common necessary correction in biomedical datasets is correction of coding-based data errors. As we saw in the Numeric facets exercise, NEISS coders were instructed to record the ages of patients with age expressed in months with a “2” at the beginning, making it appear at first glance that the patient is over 200 years. While intended to keep ages in years separate for processing purposes from ages in months, this will require correction either in the statistical analysis or other programs with which researchers will crunch this data, or here in OpenRefine before the dataset reaches that stage. The correction needed is a simple mathematical correction that can be done by subtracting 200, and dividing the resulting months by 12.0 to see the months expressed as a fraction of a year. The “’0” after the 12 is necessary when working with GREL expressions – it tells OpenRefine to express the result as a decimal.

Steps to Transform the Coded Ages

  • From the Age drop down, choose Facet - Numeric Facet. Use the sliders in the graph on the left to isolate only the ages from 200 and up.
  • From the Age drop down, choose Edit Cells - Transform

     

  • Subtract 200 from each value to remove the unneeded “2” and isolate age in months. Divide by 12.0 (the .0 ensures a result in decimal format)
  • To do this, in the GREL expression box write the simple formula: (value - 200)/12.0
  • GREL supports the mathematical operators: +, -, /, and *