Duplicates in a dataset can be easily detected through use of OpenRefine's Duplicates Facet. Correct application of the facet will require thorough knowledge of the dataset, including which field in your spreadsheet contains values that should have unique instances for each record. The values in such a column then function as unique record identifiers. A case or record number is usually a good example of this type of field. The CPSC_Case_Number column in the sample dataset is an example of such a column, since each NEISS incident should have its own case number.
In the previous exercise we identified duplicate records in the sample dataset. With these records identified, we can employ a quick technique to delete the duplicate records. The technique, called "blanking down," will be used to create blanks in the unique identifier fields of each duplicate record. Once this field is blank, all the records containing a blank in the field can be grouped and deleted at once.
After blanking down the duplicate CPSC_Case_Number values from the sample dataset, even if we take off the Duplicates facet, we know that within this table, the corresponding 8 records still have blank values for CPSC_Case_Number. Using OpenRefine's Facet by Blank feature, we can quickly identify the records that have blank values in the CPSC_Case_Number field and delete them.