Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Facet by Blank

After blanking down the duplicate CPSC_Case_Number values from the sample dataset, even if we take off the Duplicates facet, we know that within this table, the corresponding 8 records still have blank values for CPSC_Case_Number. Using OpenRefine's Facet by Blank feature, we can quickly identify the records that have blank values in the CPSC_Case_Number field and delete them.

Facet by Blank Steps

  • From the CPSC_Case_Number column drop-down choose Facet – Customized facets – Facet by blank (null or empty string)

Screenshot showing the menu path to Facet by Blank

  • In the resulting facet window on the left, you’ll see that there are 8 records that are “true” for having a blank value in the column CPSC_Case_Number. Choose “true” in this facet to isolate the 8 rows.
  • From the “All” column drop-down towards the left, choose Edit rows – Remove matching rows.

Screenshot showing the menu path from the All column to remove matching rows

  • With the Facet by Blank still in place, you can see that OpenRefine just updated to show that now there are 0 rows that meet the criteria for being blank in the field CPSC_Case_Number. These rows have been removed, and our total number of rows has changed from 1511 to 1503.

Screenshot showing that the Facet by Blank shows zero rows being True for having blank cells after the matching blank rows were removed