GalterGuides: Essential Spreadsheet Data Cleaning with OpenRefine: Duplicates Facet

Duplicates Facet

Duplicates in a dataset can be easily detected through use of OpenRefine's Duplicates Facet. Correct application of the facet will require thorough knowledge of the dataset, including which field in your spreadsheet contains values that should have unique instances for each record. The values in such a column then function as unique record identifiers. A case or record number is usually a good example of this type of field. The CPSC_Case_Number column in the sample dataset is an example of such a column, since each NEISS incident should have its own case number.

Steps to Search for Duplicate Records

From the CPSC_Case_Number drop down arrow, choose Facet – Customized facets – Duplicates facet.

The results will be a listing of true or false instances of duplicates for this field (referred to as a Boolean).

Select true on the left to narrow your view down to only the rows in the spreadsheet that have duplicates. Scroll across the 16 full records to ensure that OpenRefine correctly identified duplicate records.
From the CPSC_Case_Number drop down arrow, choose “Sort – Sort Cell values as text,” since we have not changed them to numeric format. The sort will group together all the rows that have the same CPSC_Case_Number and put them in order.

Essential Spreadsheet Data Cleaning with OpenRefine

Duplicates Facet

Steps to Search for Duplicate Records

Northwestern University Feinberg School of Medicine

Northwestern University
Feinberg School of Medicine