Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Introduction

OpenRefine is an open-source, Java-based tool originally developed by Google in 2010. It follows the general spreadsheet format utilizing cells, columns, and rows. Unlike average spreadsheet programs, OpenRefine functions as a powerful data cleaning machine, enabling transformations of large amounts of data quickly.

OpenRefine is a freely downloadable tool available at openrefine.org. It works by running a small Java server on the user's machine, hence Java must be installed in order to run it. No coding knowledge is needed to use OpenRefine; however for maximum benefit it can help to learn some GREL (General Refine Expression Language) expressions. More information on GREL expressions is provided throughout this guide.

The interface for OpenRefine is the user's default browser window. However OpenRefine does not share your data online, and the program can even be used offline.

The dataset used for the examples in this guide is a small subset of the 2018 National Electronic Injury Surveillance System dataset. This subset is available for download on Google Drive. The exercises in this guide are adapted from Library Carpentry's OpenRefine online training.


References:

1. OpenRefine: Download. OpenRefine.org. Available at: https://openrefine.org/download.html

2. US Consumer Product Safety Commission. NEISS Highlights, Data and Query Builder. Available at: https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx

3. Erin Carillo (Ed.), Owen Stephens (Ed.), Juliane Schneider (Ed.), Paul R. Pival (Ed.), Kristin Lee (Ed.), Carmi Cronje (Ed.), James Baker, Christopher Erdmann, Tim Dennis, mhidas, Daniel Bangert, Evan Williamson, … Jeffrey Oliver. (2019, July). LibraryCarpentry/lc-open-refine: Library Carpentry: OpenRefine, June 2019 (Version v2019.06.1). Zenodo. http://doi.org/10.5281/zenodo.3266144