Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Booleans and Arrays

A 'Boolean' in a spreadsheet wrangling program refers to the ability to tag cell values with either TRUE or FALSE on the basis of meeting criteria that you specify. In the next exercise we'll set up a Boolean argument for the purpose of transforming the data that meets one of the criteria.

In the Splitting and Joining Cells part of this guide you learned that some of the cells in the Narrative_1 column of the sample dataset contain two values, a narrative portion and a diagnosis portion, which are separated by a double carat symbol (>>). We'll set up a Boolean argument to find exactly which cells throughout the spreadsheet contain this phenomenon.

Setting up a Boolean argument

  • The Boolean argument will be set up as part of a custom text facet, so we can facet on the TRUE/FALSE outcome of the argument. To begin, from the Narrative_1 column's drop down arrow select Facet, then Custom text facet

  • In the Expression box type: value.contains(">>").toString()  Make sure the quotation marks are straight up-and-down, not curled.

  • This Boolean GREL expression is searching for the presence of the double carat (>>) symbol in the Narrative_1 cells. When the >> is present it assigns TRUE, and when it is absent it assigns FALSE. TRUE/FALSE does not appear anywhere in the spreadsheet, but is only available for the purpose of faceting. If you scroll down in the Preview box, you will see that TRUE has been correctly assigned. Click OK.

  • Take a look at the box that has populated for Narrative_1 in the Facet area. It shows that for most rows of the spreadsheet, the specified condition was false - most Narrative_1 cells did not contain the >> symbol. But for 124 cells, the condition is true. Click true so the spreadsheet will be faceted on that condition.
  • In the next exercise, we'll complete a transformation only on the "TRUE" cells.

 

Transforming Arrays

As stated above, the Narrative_1 cells that have been isolated by the Boolean argument have two values. In this exercise, we'll format them as an array. An array is simply a list of values in a certain format. OpenRefine's array format is to list the values inside square brackets, with each value enclosed in quotation marks and separated by a comma. The result looks like this: ["dog","cat","fish","bird"].

Why would we want to list the two values in the faceted Narrative_1 cells in an array format? Because if we do, then the two values will be considered separate entities that we can make changes to. In the example below, we'll use a two-step process to flip the values, so the diagnosis will appear first.

  • To create the array, leave the true facet selection on from the exercise above, and from Narrative_1's drop down arrow select Edit cells, then Transform
  • In the Expression box, type the GREL expression: value.match(/(.*)>>(.*)/)  Look at the result in the Preview box, but DON'T click OK yet!

The expression used above is a combination of a regular expression (contained within the / /, based on Java syntax) and a GREL expression (contained within the outer parentheses). It is not necessary to memorize these expressions. For hybrid regular-GREL expressions commonly used in OpenRefine, see the many OpenRefine recipes available on their official GitHub page.

  • We can see that the two values inside the Narrative_1 cells are now formatted as array values - they are enclosed in double quotes and separated by a comma.
  • To flip these two values, append this expression to the end of your expression above: .reverse().join(" ")

  • The result is that the two array values have been flipped and rejoined. Now the diagnosis appears first. Click OK to apply the change.