GalterGuides: Cleaning Spreadsheet Data with OpenRefine: Splitting Cells

Splitting Multi-Valued Cells

OpenRefine displays data in a tabular way, much like a spreadsheet. When a project is first loaded, a row is equivalent to a record, and columns represent the types of data within the records, or the variables. Any individual data entry in the interface is called a value, and values are contained in individual cells.

When a project is first loaded, only 5 or 10 rows of data may be shown. In the top left corner of the project you can update the view to show up to 50 rows of data. Use the 'next' and 'last' links at the top right to scroll through more of your rows of data.

Notice that in the top left there is a number displaying how many rows of data you have. You can also toggle this to show how many records you have. When a project is first loaded, these numbers are the same. However, if we do a split operation, this will change.

The Narrative_1 column in the sample dataset has instances of two types of data stored in one column. There is narrative text describing the patient's experience, and in some cases there is a preliminary diagnosis. The diagnosis is always listed after a double carat symbol (>>).

To split the two types of data out into different rows, first click on the drop down arrow at the top of the Narrative_1 column. These drop down lists are the starting point of almost all functions that can be completed in OpenRefine. Take some time to look over the options, some of which are explored in other exercises in this guide. For this exercise, choose Edit Cells, then Split Multi-Valued Cells.

The dialog box that pops up will ask if you would like to do the split using a separator or a certain character length. We know that the data has a separator, therefore click this button and enter '>>' into the box for specifying the separator. Click OK.

Notice that new rows have populated in the project showing the diagnoses, which had appeared after the >>, in their own rows. Also look at the Rows/Records links at the top of the screen and toggle between them. OpenRefine should display the same amount of records as before, but now shows more rows than before. The program recognizes that the new rows belong to their original records, but through the split operation it has created new rows for us to work with. This function can be handy for cleaning up this granular data, and perhaps re-joining it to the original cells later.

Cleaning Spreadsheet Data with OpenRefine

Splitting Multi-Valued Cells

Northwestern University Feinberg School of Medicine

Northwestern University
Feinberg School of Medicine