Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Parsing HTML in OpenRefine

To parse the description of Asthma from the HTML code block retrieved in the previous exercise, a parsing expression can be used.

Parsing HTML

  • Click on the Wikipedia_Info column's drop down arrow and select Edit column, then Add column based on this column

Screenshot showing how to add a column based on another column in OpenRefine

  • Give this new column the title 'Wikipedia_Info_Parsed'
  • In the Expression box, type the expression: value.parseHtml().select("page")[0].htmlText()

Screenshot showing a GREL formula to parse the Description information from a column with webscraped Wikipedia data

  • The result will be that in your new Wikipedia_Info_Parsed column, only the description of Asthma will appear, parsed out from the rest of the HTML code.
  • Various GREL parsing expressions can be found through sources like Library Carpentry and Stack Overflow.