Skip to Main Content

Cleaning Spreadsheet Data with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn the basic functions of OpenRefine. The class and guide are adapted from Library Carpentry OpenRefine, Copyright 2016-2019

Setting up Reconciliation in OpenRefine

Reconciliation services in OpenRefine allow you to look up items from your data in other sources online that maintain controlled lists of terms, such as the VIAF (Virtual International Authority File) or Wikidata. This allows you to update the data in your source using the controlled terms from these sources, which they can also link out to for maximum interoperability. This example will show how to reconcile values from the Other_Diagnosis column against Wikidata.

Reconciliation steps

  • As in the webscraping example, star only one row in the sample dataset and from the All drop down arrow on the left, choose Facet by Star and choose only the starred row. Reconciling using just one row at first saves time, since reconciling a full dataset can be very time consuming as OpenRefine makes API queries for thousands of rows of data.
  • From the Other_Diagnosis drop down arrow choose Reconcile, then Start reconciling

Screenshot showing the menu path to Reconciliation in OpenRefine

  • In the next screen are options to add standard reconciliation services. In this example we will use Wikidata

Screenshot showing the Wikidata Reconciliation service in OpenRefine

  • After choosing Wikidata, options will appear on the left as to which types of data you'd like to reconcile your chosen term against. Choose 'disease' as the most appropriate type of data. The options on the right allow you to choose which other columns from your data might help in the reconciliation process, as OpenRefine queries the Wikidata source for relevant terms. Choose the Wikipedia_Info_Parsed (or, Wikipedia Definition Parsed as in the image) column created in the previous webscraping exercise, since this will be the most relevant column to help in term matching.
  • Click Start Reconciling

Screenshot showing options to improve Reconciliation searches in OpenRefine

  • See the next box for the reconciliation results

Reconciliation Results

Screenshot showing Reconciliation results screens in OpenRefine

After reconciling, two new facets appear: Other_Diagnosis: judgment and Other_Diagnosis: best candidate’s score. The judgment facet shows which values have been matched. Near “matched,” there may also be a value, “none,” if you ran the Reconciliation over multiple rows and some items failed to match. As you make matches between your values and the reconciled terms, the number in ‘none’ will go down and the ‘matched’ number will go up. The best candidate’s score facet has to do with how well the values matched against those from the online authority file, based on fuzzy matching.

Notice the check-marks next to the choices for matched values back in the original Other_Diagnosis column. Two related terms besides “asthma” were found in Wikidata through the reconciliation process. To complete the reconciliation, any of these terms can be selected. Clicking the single-check box will accept the answer for one cell. Clicking the multi-check box will accept the new value for all cells in which the original value appeared.

Keep in mind that reconciliation will take much longer to run if it is run over the entire spreadsheet instead of just one row of data. For more on OpenRefine Reconciliation services, see the official OpenRefine GitHub page’s guidance on Reconcilable Data Sources.