Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Transposing Columns into Rows

Occasionally in data wrangling a column must be redefined or reconfigured in order to store data in a more logical way. In this exercise we’ll demonstrate how to do that by transposing columns. In the sample dataset there are three dimension-based columns, P_Height, P_Width, and P_Depth. They could be reconfigured to result in two columns, Dimension and Measurement, with the specific dimension referred to spread out over new rows. Although this may seem confusing, the new rows will remain associated with their original records.

Steps to Transpose Columns into Rows

  • From the All drop down arrow choose Edit Columns – Reorder/remove columns and drag P_Height, P_Width, and P_Depth to the top

  • In the OpenRefine main screen, toggle over to records mode so that OpenRefine understands that we will be applying these changes to entire records, not individual rows.
  • Make sure that the P_Height, P_Width, and P_Depth columns are all next to each other, as this is necessary for Transposition to work.
  • From the drop down arrow at the top of the left-most column (in this case, P_Height), choose Transpose, then Transpose cells across columns into rows.

  • In the Transposition dialog box, choose the beginning and ending columns on the left (From and To). In our example From will be the P_Height column, and To will be P_Depth.
  • On the right, name the two new columns Dimension and Measurement. Dimension is a key column, or, the one that contains the original columns’ names. Measurement is a value column, containing the original column’s values
  • Transposing the original three columns into one column is also possible, but this results in multi-valued cells which would need to be broken out. Therefore leave the One column button unchecked.
  • Ignore blank cells is an option to not have new rows created for P_Height, P_Width, or P_Depth if the measurement value was blank for any instance. If you would like to always see the three new rows in the new, transposed cells, even if the Measurement value is blank, then uncheck Ignore blank cells.
  • When your screen looks like the one below, click Transpose.

  • The measurement-related columns in the project have now been reduced, from three to two.

  • Although new rows have been added, they are still associated with their original records, as can be seen by the shading cues OpenRefine provides in record mode. The record numbers that OpenRefine assigns are also a cue that no new records have been created.
  • The results can also be viewed in rows mode. However be aware that this will make the dataset appear to expand, as rows are created and all the non-dimensional data is repeated to go with each instance of height, width, and depth. When transposing, it’s best to keep the project in record mode.
  • Note: Transposing in the opposite direction, from rows into columns, is also possible. See the questions, answers, and examples available on Stack Overflow: