Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Blanking Down

In the previous exercise we identified duplicate records in the sample dataset. With these records identified, we can employ a quick technique to delete the duplicate records. The technique, called "blanking down," will be used to create blanks in the unique identifier fields of each duplicate record. Once this field is blank, all the records containing a blank in the field can be grouped and deleted at once.

Blanking Down Steps

  • Make sure that the Duplicates facet is still applied from the previous exercise.
  • From the CPSC_Case_Number drop down arrow, choose “Sort – Sort Cell values as text,” since we have not changed them to numeric format. The sort will group together all the rows that have the same CPSC_Case_Number and put them in order.

Screenshot showing the menu path to Sort in OpenRefine

Screenshot showing Sort options in OpenRefine

  • A new Sort option should appear near your top navigation. From that drop down, choose Reorder rows permanently. This enables the next step, when we work to blank down all the rows that are duplicates.

Screenshot showing an option that allows reordering rows permanently in OpenRefine

  • When the next step is employed an update will happen very quickly. From the CPSC_Case_Number column drop-down choose Edit cells – Blank down. Through this action, OpenRefine automatically sought the second occurrence of the duplicate CPSC_Case_Number values and made them blank. Since the Duplicates facet is still applied to the CPSC_Case_Number column, the facet has now updated itself to show that 8 rows have matches for duplicates in this field. Only now, these rows have duplicate values for CPSC_Case_Number because that duplicate value is “blank.”
  • Take the Duplicates filter off by x-ing out.

Screenshot showing the menu path to blank down cells in OpenRefine

Screenshot showing the result of blanking down, eight rows which match for having blank cells

  • In the next exercise, we'll learn a quick trick for deleting these blank rows using Facet by Blank.