Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Configure a Project

After loading your file, you will see a preview of the first few rows of data. At the top of this preview screen you have the option to rename the data file on which you're working. Keep in mind that this file is a copy of your original data file, and that the original will not be changed in any way. Re-naming the file in OpenRefine can keep it from being confused with the original data file.

At the bottom of this screen are additional options for setting up your project. For the NEISS dataset the Character encoding works best set on UTF-8; other options are available in the drop down menu if the characters are populating incorrectly in this preview screen. To the right, check the box to parse the text of the first line of your dataset as column headers. For the purpose of these exercises, make sure that "Parse cell text into number, dates, etc." is unchecked, so OpenRefine won't automatically try to identify numbers, dates, etc. In other scenarios, however, you may wish to check this box.

When your configuration screen looks like the one below, click "Create Project" in the upper right corner.

Screenshot showing the OpenRefine project preview page