Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Introduction

OpenRefine is an open-source, Java-based tool originally developed by Google in 2010. It follows the general spreadsheet format utilizing cells, columns, and rows. Unlike average spreadsheet programs, OpenRefine functions as a powerful data cleaning machine, enabling transformations of large amounts of data quickly.

OpenRefine is a freely downloadable tool available at openrefine.org. It works by running a small Java server on the user's machine, hence Java must be installed in order to run it. No coding knowledge is needed to use OpenRefine; however for maximum benefit it can help to learn some GREL (General Refine Expression Language) expressions. More information on GREL expressions is provided throughout this guide.

The interface for OpenRefine is the user's default browser window. However OpenRefine does not share your data online, and the program can even be used offline.

The exercises in this guide build on those in the Galter Library class Cleaning Spreadsheet Data with OpenRefine and outlined in the associated Guide. The dataset used for examples in this guide is a small subset of the 2019 National Electronic Injury Surveillance System dataset. This subset is available for download on Google Drive. Also necessary for one exercise is the NEISS Product Codes and Product Names table available on Google Drive.


References:

1. OpenRefine: Download. OpenRefine.org. Available at: https://openrefine.org/download.html

2. US Consumer Product Safety Commission. NEISS Highlights, Data and Query Builder. Available at: https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx