Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Sort on Multiple Columns

Sorts are helpful for exploring your data in OpenRefine, and for learning more when numeric or other facets don't provide much information. Sorts can be explored using the sample dataset's Alcohol Involved and Drug Involved columns. The values in these columns denote whether alcohol or drug use was involved in the patient’s injury.

Sorting on Multiple Columns Steps

  • From the All drop-down arrow, choose Edit Columns – Reorder/Remove columns to drag the Alcohol involved and Drug involved columns up, or, to the left.
  • Try a numeric facet on “Alcohol involved.” What does it show? It seems that from the dataset of over 1500 records, only 10 are coded 1 for alcohol involvement. Drag the number slider to the right, and any designation over zero will be counted as the “1” value.

Screenshot showing the result of a Numeric facet on one column in OpenRefine

  • Close the numeric facet and move to a Sort. From the Alcohol involved column’s drop down arrow, choose Sort.

Screenshot showing Sort options in OpenRefine

  • In the dialog box which pops up, you can specify the type of data in the column, which in this case is numbers. You’ll then have the option to sort from smallest first or largest first. In this case, choose largest first. The ten records that have a “1” for alcohol involvement will then sort to the top. Expand the number of records visible on your screen to 25, and notice that it is still only the top ten that have a “1” value.
  • In the Drug Involved column do the same sort, by numbers, from largest to smallest.

Screenshot showing Sort options on a different column in OpenRefine

  • Notice that now by your smallest first or largest first options, a new checkbox has appeared called “sort by this column alone.” If you choose this option, it will restart the sort using this column. You can employ this option, but if so you will lose the previous sort. Choose the "largest first" option instead.
  • With both columns sorted on largest first, it is clear that there is only one record in the dataset in which there was both alcohol and drug involvement in the patient's injury.

Screenshot showing the result of a Sort by two columns in OpenRefine

  • The “Sort by” dialog box also gives you many options for how to handle the various types of values that might be present in the columns. You can choose in which order you would like to see valid values, errors, and blank cells. Remember that having a handle on blank cells can be handy for finding all matching blanks and deleting them.