Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Duplicates and Blanks

Duplicates in a dataset can be easily detected through use of OpenRefine's Duplicates Facet. Correct application of the facet will require thorough knowledge of the dataset, including which field in your spreadsheet contains values that should have unique instances for each record. The values in such a column then function as unique record identifiers. A case or record number is usually a good example of this type of field. The CPSC_Case_Number column in the sample dataset is an example of such a column, since each NEISS incident should have its own case number.

Steps to Search for Duplicate Records

  •  From the CPSC_Case_Number drop down arrow, choose Facet – Customized facets – Duplicates facet.

  • The results will be a listing of true or false instances of duplicates for this field (referred to as a Boolean).

  • Select true on the left to narrow your view down to only the rows in the spreadsheet that have duplicates. Scroll across the 16 full records to ensure that OpenRefine correctly identified duplicate records.
  • From the CPSC_Case_Number drop down arrow, choose “Sort – Sort Cell values as text,” since we have not changed them to numeric format. The sort will group together all the rows that have the same CPSC_Case_Number and put them in order.

In the previous exercise we identified duplicate records in the sample dataset. With these records identified, we can employ a quick technique to delete the duplicate records. The technique, called "blanking down," will be used to create blanks in the unique identifier fields of each duplicate record. Once this field is blank, all the records containing a blank in the field can be grouped and deleted at once.

Blanking Down Steps

  • Make sure that the Duplicates facet is still applied from the previous exercise.
  • From the CPSC_Case_Number drop down arrow, choose “Sort – Sort Cell values as text,” since we have not changed them to numeric format. The sort will group together all the rows that have the same CPSC_Case_Number and put them in order.

  • A new Sort option should appear near your top navigation. From that drop down, choose Reorder rows permanently. This enables the next step, when we work to blank down all the rows that are duplicates.

  • When the next step is employed an update will happen very quickly. From the CPSC_Case_Number column drop-down choose Edit cells – Blank down. Through this action, OpenRefine automatically sought the second occurrence of the duplicate CPSC_Case_Number values and made them blank. Since the Duplicates facet is still applied to the CPSC_Case_Number column, the facet has now updated itself to show that 8 rows have matches for duplicates in this field. Only now, these rows have duplicate values for CPSC_Case_Number because that duplicate value is “blank.”
  • Take the Duplicates filter off by x-ing out.

  • In the next exercise, we'll learn a quick trick for deleting these blank rows using Facet by Blank.

After blanking down the duplicate CPSC_Case_Number values from the sample dataset, even if we take off the Duplicates facet, we know that within this table, the corresponding 8 records still have blank values for CPSC_Case_Number. Using OpenRefine's Facet by Blank feature, we can quickly identify the records that have blank values in the CPSC_Case_Number field and delete them.

Facet by Blank Steps

  • From the CPSC_Case_Number column drop-down choose Facet – Customized facets – Facet by blank (null or empty string)

  • In the resulting facet window on the left, you’ll see that there are 8 records that are “true” for having a blank value in the column CPSC_Case_Number. Choose “true” in this facet to isolate the 8 rows.
  • From the “All” column drop-down towards the left, choose Edit rows – Remove matching rows.

  • With the Facet by Blank still in place, you can see that OpenRefine just updated to show that now there are 0 rows that meet the criteria for being blank in the field CPSC_Case_Number. These rows have been removed, and our total number of rows has changed from 1511 to 1503.