Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Faceting and Clustering

Faceting is one of the most powerful features of OpenRefine. Faceting counts the unique values in every column and displays them in a box on the left side of the screen. This helps users quickly find inconsistencies in the data.In this example we'll facet the Other_Diagnosis column from the sample dataset. It can help to first go to the All column, then choose Edit Cells - Reorder/remove columns, and drag the Other_Diagnosis column towards the top. This brings the column farther left on the screen.

Faceting Steps

  • From the Other_Diagnosis column's drown down arrow, choose Facet, then Text facet

  • Notice that the column name Other_Diagnosis and all its values have populated in the Facet area on the left side of the screen.

  • The Faceted view of this data offers various options for data cleaning. Firstly, the terms can be arranged either alphabetically or in the order of how frequently they occur in the column.
  • Clicking on any term will select it and allow you to work with only the rows in which that term appears in the Other_Diagnosis column.
  • After selecting terms, you can edit them either in their individual cells, or through the small blue edit link in the faceting area. If you fix a mis-spelling, be aware that the mis-spelled instances of terms will then disappear, as they join up with other instances that were spelled correctly.

The previous exercise, Faceting, showed how to isolate individual data values for the purpose of examining and cleaning them. Though an extremely useful feature of OpenRefine, it can be time consuming to clean individual data points this way in large datasets. This is where OpenRefine's Clustering feature can be extremely helpful.

To begin, choose the Other_Diagnosis column from the sample dataset. From the drop down arrow at the top choose Edit Cells, then Cluster and edit.

In the results screen, OpenRefine has automatically brought together the terms that seem most related into clusters. Here you can quickly review the groupings and select the term that should be used according to your data standards or conventions. As you make your selections click the Merge checkbox next to them, and when you've made all your selections click 'Merge Selected & Re-cluster' at the bottom of the screen to search for any additional matches.

  • It can be helpful to experiment with the Method and Keying function (algorithm) options at the top of the Cluster & Edit screen. Different algorithms for generating the matches can work better or worse, depending on the data type.
  • What populates in New Cell Value on the right will always be the value that appeared the most times. This is only a suggestion, and is not necessarily the correct format for the data. You can replace this with one of the other related values by clicking that value, and you can also manually enter your own New Cell Value.