Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Transformations

Transformations are a powerful aspect of OpenRefine. Some common transformations are pre-loaded into the system such as trimming leading and training whitespace, changing all text in a cell to Title case (capitalizing the first letter of all words), Upper case, or Lower case, or transforming the format of the values in cells to number, date, or text.

To demonstrate how the trimming whitespace transformation works, go to the Other_Diagnosis column in the sample dataset and, from the drop down arrow, choose Facet, then Text Facet. There should be 938 choices. Scroll down the list until LOW BACK PAIN is near the top,

Notice that there are two seemingly identical entries for LOW BACK PAIN. The only difference is that the one with only two occurrences has an extra space at the end. By running the Trim leading and trailing whitespace transformation outlined above, the extra space will automatically be deleted and the incorrect entries will be joined up with the others, to make a total of 120 instances of LOW BACK PAIN.

Although common transformations are built in to OpenRefine, you can write a wide variety of different transformations using GREL, or General Refine Expression Language. Transformations can range from simple to very complex. It is not necessary to memorize them as long as you have access to a sample of GREL 'recipe books.' Links can be found at the bottom of this page.

Writing a Simple Transformation

Below is the GREL recipe version of the Title case transformation (changing the text of a value so that each word starts with a capital letter)

  • From the top of a column choose Edit Cells, then Transform. A dialog box like that seen above will pop up.
  • In the expression box type value.toTitlecase()   The expressions are case-sensitive, and the word "value" should appear only once.
  • Notice that in the Preview window, you can see the effect that your transformation will have. If it looks correct, click OK,

Writing a Complex Transformation

In the Product_1 field of the sample dataset, we see product numbers with the prefix "NEISS" before every entry. To work with those numbers, we might want to strip away the prefix first.

  • From the top of the Product_1 column, choose Edit Cells, then Transform
  • In the dialog box, within the Expression box, type the function seen below: value.partition(smartSplit(value,"-")[1])[1]

As the preview box shows, this split transformation works by using the hyphen in the value as the separator on which to split, and strips away everything before that separator, plus the separator itself. It is not necessary to memorize such transformations - users can search for specific transformation recipes online, or look through some of the recipe collections linked below.


OpenRefine Recipes

GREL Functions

In the project setup steps recommended in this guide for the sample dataset, it was recommended to direct OpenRefine not to automatically parse out dates and numbers. Therefore any columns that appear to contain dates or numbers will consider them as text. For the Treatment_Date field, we can do a simple transformation to change it from text format to date.

Common Transform: To Date

  • From the drop down arrow from the Treatment Date column, choose Edit cells, Common Transforms, then To date

  • When this tranformation is applied, the date formatted as text changes to an ISO-formatted datestamp, complete with timestamp:

Several different GREL formulas for additional transformations that can be done starting with data in the date format can be found here. In the example below we will transform this date back to a text format in a different style, while still maintaining the original date column.

Transform Date to Text in New Style

  • From the Date column drop down arrow, select Edit column, then Add column based on this column

  • Give the new column the name Formatted_Date
  • In the Expression box type: value.toString("dd MMMM yyyy")  Make sure the quotation marks are straight up-and-down, not curled.

  • Click OK. This will change the date back to text format, in a different style.

If you would like to perform mathematical calculations or transformations directly in OpenRefine, these functions are available. A list of functions with their GREL recipes can be found at the official OpenRefine GitHub page's Math Functions list. A few of these functions and their formulas are outlined below.

Math Functions

  • From the Weight column drop down arrow in the sample dataset, choose Edit cells, then Transform, then To number (recall that we did not parse numeric cells into number format in the project setup, so that change must be manually made now).

  • With the values in the Weight column now in number format, math functions can be applied. To begin, from the Weight column drop down arrow choose Edit cells, then Transform.

  • In the Expression box, type round(value). This rounds the value to the nearest integer.

  • Similar math functions include:
    • log(value)
    • pow(value,x) - where the value is raised to the x-th power.