Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Retrieving Data from URLs

A powerful feature of OpenRefine is the ability to pull external data from URLs. This exercise will apply this feature to the Other_Diagnosis column of the sample dataset. Pulling in data from outside sources can be time-consuming as OpenRefine must query them, therefore we'll work with just one row.

Steps for Fetching Data from a URL

  • In the left-most column there is a blank star. Click this to apply a star to row 4, which has a value of Asthma for Other_Diagnosis. Then from the All column on the far left, select Facet, then Facet by star. This will result in Boolean-style true/false choices in the Facet area.

Screenshot showing how to facet by star in the All column in OpenRefine

  • Select true from the choices that come up in the Facet area. Now only row 4 should be selected.
  • From the Other_Diagnosis column drop down arrow, select Edit column, then Add column by fetching URL.

Screenshot showing the menu path for adding a column by fetching a URL

  • For the new column's title, use Wikipedia_Info. In the Expression box, type the following, including the spaces:

"https://en.wikipedia.org/w/api.php?action=opensearch&action=query&generator=prefixsearch&gpssearch="

+ value.escape('url')

+ "&prop=extracts&exintro=1&explaintext=1&redirects=1&limit=10&namespace=0&format=xml"

Screenshot showing a dialog box with a formula for webcsraping Wikipedia

  • The result will be that a long string of HTML encoded text will be added in the new column next to Other_Diagnosis

Screenshot showing the results of Wikipedia webscraping in OpenRefine

See the next section to learn how to parse useful bits from this data.