Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Sort and Facet on Multiple Columns

Sorts are helpful for exploring your data in OpenRefine, and for learning more when numeric or other facets don't provide much information. Sorts can be explored using the sample dataset's Alcohol Involved and Drug Involved columns. The values in these columns denote whether alcohol or drug use was involved in the patient’s injury.

Sorting on Multiple Columns Steps

  • From the All drop-down arrow, choose Edit Columns – Reorder/Remove columns to drag the Alcohol involved and Drug involved columns up, or, to the left.
  • Try a numeric facet on “Alcohol involved.” What does it show? It seems that from the dataset of over 1500 records, only 10 are coded 1 for alcohol involvement. Drag the number slider to the right, and any designation over zero will be counted as the “1” value.

  • Close the numeric facet and move to a Sort. From the Alcohol involved column’s drop down arrow, choose Sort.

  • In the dialog box which pops up, you can specify the type of data in the column, which in this case is numbers. You’ll then have the option to sort from smallest first or largest first. In this case, choose largest first. The ten records that have a “1” for alcohol involvement will then sort to the top. Expand the number of records visible on your screen to 25, and notice that it is still only the top ten that have a “1” value.
  • In the Drug Involved column do the same sort, by numbers, from largest to smallest.

  • Notice that now by your smallest first or largest first options, a new checkbox has appeared called “sort by this column alone.” If you choose this option, it will restart the sort using this column. You can employ this option, but if so you will lose the previous sort. Choose the "largest first" option instead.
  • With both columns sorted on largest first, it is clear that there is only one record in the dataset in which there was both alcohol and drug involvement in the patient's injury.

  • The “Sort by” dialog box also gives you many options for how to handle the various types of values that might be present in the columns. You can choose in which order you would like to see valid values, errors, and blank cells. Remember that having a handle on blank cells can be handy for finding all matching blanks and deleting them.

For most data cleaning tasks, faceting on one column at a time works well in order to isolate discrete fields of data. However faceting by multiple columns is possible in OpenRefine, and this can include facets on different types of values, such as Text and Numeric. The image below shows simultaneous facets on the number-based Sex column and the Text-based Treatment_Date and Other_Diagnosis columns from the sample dataset.

By “including” or narrowing in on certain entries within a facet, the other facets will update (narrow down) based on the first entry you selected. This essentially takes a slice of your data, one variable at a time.

Steps for employing facets to identify and clean data-entry errors are outlined in Galter Library's introductory OpenRefine class, Cleaning Spreadsheet Data with OpenRefine. While much data cleaning can be done by faceting on one column at a time, it’s also possible to facet on two columns or more at once. You can apply these multiple facets manually to hone in on highly specified subsets of your data. You can even combine text-based and numeric facets at the same time. Using a GREL formula, you can achieve this in one step.

Facet on Multiple columns with a GREL Expression

  • From the drop down arrow of a column in the sample dataset, choose Facet - Custom text facet. In this example we will use the Sex column and combine with a facet on Age, to facet the dataset on sex and age at once.

  • In the Expression dialog box that pops up, enter the formula: join([ cells[“Col 1"].value, cells[“Col 2"].value ], ",") in the Expression box. Substitute the columns you want to facet together in the Col 1 and Col 2 spaces - in this case, Sex and Age. Inside the last quote marks after value, use any separator you would like to keep the two columns’ values visually separated.

  • As the results show, using this custom facet formula allows us to hone in on male or female patients of particular ages.