Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Setting up a Project

When you download the program from OpenRefine.org, a Zip file will go to your Downloads folder. Unpack or Unzip this file to the new location on your machine where you'd like the program to reside. When the Zip file has been unpacked, you should see something like the following:

The application is the file with the diamond icon. You can create a shortcut to the application and place it on your desktop for easy access.

Once OpenRefine is downloaded, you can click on the blue diamond icon to launch it. First a small Java window will appear, which can be minimized and kept running in the background. Next the program itself will load in your default browser window.

You can open files stored on your computer in many different formats, including CSV, Excel, JSON, and XML. If you've downloaded the CSV file for these exercises from Google Drive, you can browse and open that file here.

You can also type in Web addresses of files. When your files have been selected or typed in, click Next.

After loading your file, you will see a preview of the first few rows of data. At the top of this preview screen you have the option to rename the data file on which you're working. Keep in mind that this file is a copy of your original data file, and that the original will not be changed in any way. Re-naming the file in OpenRefine can keep it from being confused with the original data file.

At the bottom of this screen are additional options for setting up your project. For the NEISS dataset the Character encoding works best set on UTF-8; other options are available in the drop down menu if the characters are populating incorrectly in this preview screen. To the right, check the box to parse the text of the first line of your dataset as column headers. For the purpose of these exercises, make sure that "Parse cell text into number, dates, etc." is unchecked, so OpenRefine won't automatically try to identify numbers, dates, etc. In other scenarios, however, you may wish to check this box.

When your configuration screen looks like the one below, click "Create Project" in the upper right corner.

Most operations in OpenRefine start with the menu options that can be seen by clicking the drop down arrow at the top of each column. Generally in OpenRefine you'll transform data in one column at a time. However the drop down arrow next to 'All' on the far left of the screen allows you to perform operations affecting all rows or all columns.

One of the most helpful features to know about from the All column menu is under Edit Columns and is called 'Reorder/remove columns.' By selecting this, you will be presented with a list of all the columns in the spreadsheet, listed in the same order from top to bottom that they appear left to right on the screen. Use this feature to drag columns toward the top if you'd like them farther left on the screen, since columns appearing on the right may be difficult to view.