Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Duplicates Facet

Duplicates in a dataset can be easily detected through use of OpenRefine's Duplicates Facet. Correct application of the facet will require thorough knowledge of the dataset, including which field in your spreadsheet contains values that should have unique instances for each record. The values in such a column then function as unique record identifiers. A case or record number is usually a good example of this type of field. The CPSC_Case_Number column in the sample dataset is an example of such a column, since each NEISS incident should have its own case number.

Steps to Search for Duplicate Records

  •  From the CPSC_Case_Number drop down arrow, choose Facet – Customized facets – Duplicates facet.

Screenshot showing the menu path to the Duplicates facet in OpenRefine

  • The results will be a listing of true or false instances of duplicates for this field (referred to as a Boolean).

Screenshot showing the sixteen rows that match for having duplicate rows

  • Select true on the left to narrow your view down to only the rows in the spreadsheet that have duplicates. Scroll across the 16 full records to ensure that OpenRefine correctly identified duplicate records.
  • From the CPSC_Case_Number drop down arrow, choose “Sort – Sort Cell values as text,” since we have not changed them to numeric format. The sort will group together all the rows that have the same CPSC_Case_Number and put them in order.

Screenshot showing the menu path to Sort in OpenRefine

Screenshot showing Sort options in OpenRefine