Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Advanced: Webscraping

A powerful feature of OpenRefine is the ability to pull external data from URLs. This exercise will apply this feature to the Other_Diagnosis column of the sample dataset. Pulling in data from outside sources can be time-consuming as OpenRefine must query them, therefore we'll work with just one row.

Steps for Fetching Data from a URL

  • In the left-most column there is a blank star. Click this to apply a star to row 4, which has a value of Asthma for Other_Diagnosis. Then from the All column on the far left, select Facet, then Facet by star. This will result in Boolean-style true/false choices in the Facet area.

  • Select true from the choices that come up in the Facet area. Now only row 4 should be selected.
  • From the Other_Diagnosis column drop down arrow, select Edit column, then Add column by fetching URL

  • For the new column's title, use Wikipedia_Info. In the Expression box, type the following, including the spaces:

"https://en.wikipedia.org/w/api.php?action=opensearch&action=query&generator=prefixsearch&gpssearch="

+ value.escape('url')

+ "&prop=extracts&exintro=1&explaintext=1&redirects=1&limit=10&namespace=0&format=xml"

  • The result will be that a long string of HTML encoded text will be added in the new column next to Other_Diagnosis

See the next tab to learn how to parse useful bits from this data.

To parse the description of Asthma from the HTML code block retrieved in the previous exercise, a parsing expression can be used.

Parsing HTML

  • Click on the Wikipedia_Info column's drop down arrow and select Edit column, then Add column based on this column

  • Give this new column the title 'Wikipedia_Info_Parsed'
  • In the Expression box, type the expression: value.parseHtml().select("page")[0].htmlText()

  • The result will be that in your new Wikipedia_Info_Parsed column, only the description of Asthma will appear, parsed out from the rest of the HTML code.
  • Various GREL parsing expressions can be found through sources like Library Carpentry and Stack Overflow.