Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Facet on Multiple Columns - Shortcut

Steps for employing facets to identify and clean data-entry errors are outlined in Galter Library's introductory OpenRefine class, Cleaning Spreadsheet Data with OpenRefine. While much data cleaning can be done by faceting on one column at a time, it’s also possible to facet on two columns or more at once. You can apply these multiple facets manually to hone in on highly specified subsets of your data. You can even combine text-based and numeric facets at the same time. Using a GREL formula, you can achieve this in one step.

Facet on Multiple columns with a GREL Expression

  • From the drop down arrow of a column in the sample dataset, choose Facet - Custom text facet. In this example we will use the Sex column and combine with a facet on Age, to facet the dataset on sex and age at once.

Screenshot showing the menu path to a Custom Text Facet in OpenRefine

  • In the Expression dialog box that pops up, enter the formula: join([ cells[“Col 1"].value, cells[“Col 2"].value ], ",") in the Expression box. Substitute the columns you want to facet together in the Col 1 and Col 2 spaces - in this case, Sex and Age. Inside the last quote marks after value, use any separator you would like to keep the two columns’ values visually separated.

Screenshot showing a Custom Text Facet GREL formula for Faceting on multiple columns in OpenRefine

  • As the results show, using this custom facet formula allows us to hone in on male or female patients of particular ages.