Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Reconciliation using the MeSH SPARQL Endpoint

Reconciliation services allow you to look up terms from your data source in other sources online that maintain controlled vocabularies, such as Wikidata or MeSH. This allows you to add to your project metadata values that are expressed in controlled ways, which can help with interoperability. In this example, we will reconcile one value from the Other_Diagnosis column in the sample dataset against the Medical Subject Headings (MeSH) in order to find the MeSH-controlled way of expressing it.

Steps for Reconciliation using the MeSH SPARQL Endpoint

  • Star a row which contains a term in the Other_Diagnosis field that you’d like to reconcile against MeSH. In the example we’ll use the row with ‘hypothermia’ in Other_Diagnosis. Facet by Star from the All column so that only this row is selected. We include this step so that reconciliation will work over only one term. Since a website must be queried in the process, reconciling against thousands of terms can take a few hours.

Screenshot showing how to Facet by Star from the All column in OpenRefine

  • In the Facet results, choose "true" to work with only the starred row.
  • Click the Other_Diagnosis column drop-down and choose: Reconcile - Start reconciling.

Screenshot showing the menu path to start reconciling a column of data in OpenRefine

  • In the next screen you will see a list of the installed reconciliation services you can choose from. Choose MeSH Reconciliation, which we installed in the previous step, from the available Services. This will start the query to the MeSH SPARQL endpoint, which can take up to five minutes.
  • When the Reconciliation screen appears, choose the entity which ends in “TopicalDescriptor” on the left in order to query the MeSH topical subject headings. On the right you can choose other columns from your project that might be helpful to the reconciliation service as it queries MeSH’s available definitions for the chosen term. If none of the other columns are related to your chosen term, leave this area blank. Click “Start Reconciling.” Be prepared to wait a few minutes while OpenRefine queries the Reconciliation service.

Screenshot showing the reconciliation option screens in OpenRefine

  • After reconciling, two new facets appear: Other_Diagnosis: judgment and Other_Diagnosis: best candidate’s score. The judgment facet shows which values have been matched. Near “matched,” there may also be a value, “none,” if you ran the Reconciliation over multiple rows and some items failed to match. As you make matches between your values and the reconciled terms, the number in ‘none’ will go down and the ‘matched’ number will go up. The best candidate’s score facet has to do with how well the values matched against those from the online authority file, based on fuzzy matching.

Screenshot showing the results of reconciliation in OpenRefine

  • Notice the check-marks next to the choices for matched values back in our original Other_Diagnosis column. As you can see, you can choose to match your original term to the term “Hypothermia” or to two other related terms. To complete the Reconciliation, we can choose one or the other. Clicking the single-check box will accept the answer for one cell. Clicking the multi-check box will accept the new value for all cells in which it appears.
  • The Reconciliation function likely worked quickly because we starred and faceted on one row, and therefore were only reconciling one term. The more terms you reconcile at once, the longer it will take for OpenRefine to complete the function. For more on OpenRefine Reconciliation services, see: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources