Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Splitting and Joining Cells

OpenRefine displays data in a tabular way, much like a spreadsheet. When a project is first loaded, a row is equivalent to a record, and columns represent the types of data within the records, or the variables. Any individual data entry in the interface is called a value, and values are contained in individual cells.

When a project is first loaded, only 5 or 10 rows of data may be shown. In the top left corner of the project you can update the view to show up to 50 rows of data. Use the 'next' and 'last' links at the top right to scroll through more of your rows of data.

Notice that in the top left there is a number displaying how many rows of data you have. You can also toggle this to show how many records you have. When a project is first loaded, these numbers are the same. However, if we do a split operation, this will change.

The Narrative_1 column in the sample dataset has instances of two types of data stored in one column. There is narrative text describing the patient's experience, and in some cases there is a preliminary diagnosis. The diagnosis is always listed after a double carat symbol (>>).

To split the two types of data out into different rows, first click on the drop down arrow at the top of the Narrative_1 column. These drop down lists are the starting point of almost all functions that can be completed in OpenRefine. Take some time to look over the options, some of which are explored in other exercises in this guide. For this exercise, choose Edit Cells, then Split Multi-Valued Cells.

The dialog box that pops up will ask if you would like to do the split using a separator or a certain character length. We know that the data has a separator, therefore click this button and enter '>>' into the box for specifying the separator. Click OK.

Notice that new rows have populated in the project showing the diagnoses, which had appeared after the >>, in their own rows. Also look at the Rows/Records links at the top of the screen and toggle between them. OpenRefine should display the same amount of records as before, but now shows more rows than before. The program recognizes that the new rows belong to their original records, but through the split operation it has created new rows for us to work with. This function can be handy for cleaning up this granular data, and perhaps re-joining it to the original cells later.

Just as we can split up cells that have two types of values, or two different types of data, we can also join up rows into multi-valued cells. If you completed the cell-splitting exercise on the Narrative_1 column of the sample dataset, we will build on that and re-join the cells we split.

From the drop down menu at the top of the Narrative_1 column, choose Edit Cells, then Join Multi-Valued Cells.

 

The dialog box that pops up will ask what separator you would like to add between the two distinct values you are about to join. You can choose any number, letter, or symbol you would like. Using the double carat (>>) will restore the Narrative_1 cells to the way they originally appeared.

This exercise works because there was pre-existing data in multiple pre-separated rows, which could be joined back up to the parent records.