Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Installing Extensions and Reconciling with API Endpoints

Installing extensions is an important first step to taking advantage of some of the online services with which OpenRefine can interact. A list of currently supported extensions is available on the OpenRefine downloads page. In this exercise we'll install the RDF extension. RDF stands for “Resource Description Framework,” which is a framework for representing information on the Internet. With this extension in place, we can query APIs and SPARQL endpoints, which is a way of bringing information from outside websites into our OpenRefine projects.

Steps to Install the RDF Extension

  • On the RDF extension GitHub page, download the latest release of the extension: RDF Extension v1.4.0
  • Navigate to the folder on your machine where you have stored your downloaded OpenRefine files
  • Navigate to the folder openrefine-3.7.7/webapp/extensions
  • In extensions, create a new folder titled rdf-extension

  • Find your downloaded RDF extension zip file. Click to Extract All, and when asked for a destination folder navigate to the new rdf-extension folder you created in your OpenRefine extensions. Extract to this location.

  • Close the OpenRefine application and re-start it to load the new extension.

If you've completed installation of the RDF extension as outlined in the previous exercise, you'll notice a new box in the upper right corner of your OpenRefine project screen, by Extensions, labeled "RDF." In this exercise, we will add a reconciliation service using this resource description framework extension. A reconciliation service works by querying websites and bringing controlled lists of terms over to your OpenRefine project. In this case we will add a service for MeSH, the Medical Subject Headings list, so we can compare terms from the Other_Diagnosis field in the sample dataset to the controlled MeSH terms used in PubMed and many biomedical repositories.

Steps to Add the MeSH SPARQL Endpoint for Reconciliation

  • From the RDF extension drop down menu at the top right of your OpenRefine project screen, choose the option to add a reconciliation service.

  • From the drop down list of options, choose to add a service based on a SPARQL endpoint. A SPARQL endpoint is a kind of port on an HTML website or web service that is capable of receiving queries in a specialized format. When you add such an endpoint to OpenRefine, it allows the program to query the web service.

  • In the SPARQL Endpoint dialog box, provide a human-readable name for your SPARQL-based reconciliation service. Then enter the Medical Subject Headings (MeSH) SPARQL endpoint: http://id.nlm.nih.gov/mesh/sparql  Then click OK.

  • The SPARQL endpoint will now be available under the name provided, in this case "MeSH Reconciliation," whenever you would like to reconcile OpenRefine columns against Medical Subject Heading terms. Learn how to do this in the next section.

Reconciliation services allow you to look up terms from your data source in other sources online that maintain controlled vocabularies, such as Wikidata or MeSH. This allows you to add to your project metadata values that are expressed in controlled ways, which can help with interoperability. In this example, we will reconcile one value from the Other_Diagnosis column in the sample dataset against the Medical Subject Headings (MeSH) in order to find the MeSH-controlled way of expressing it.

Steps for Reconciliation using the MeSH SPARQL Endpoint

  • Star a row which contains a term in the Other_Diagnosis field that you’d like to reconcile against MeSH. In the example we’ll use the row with ‘hypothermia’ in Other_Diagnosis. Facet by Star from the All column so that only this row is selected. We include this step so that reconciliation will work over only one term. Since a website must be queried in the process, reconciling against thousands of terms can take a few hours.

  • In the Facet results, choose "true" to work with only the starred row.
  • Click the Other_Diagnosis column drop-down and choose: Reconcile - Start reconciling.

  • In the next screen you will see a list of the installed reconciliation services you can choose from. Choose MeSH Reconciliation, which we installed in the previous step, from the available Services. This will start the query to the MeSH SPARQL endpoint, which can take up to five minutes.
  • When the Reconciliation screen appears, choose the entity which ends in “TopicalDescriptor” on the left in order to query the MeSH topical subject headings. On the right you can choose other columns from your project that might be helpful to the reconciliation service as it queries MeSH’s available definitions for the chosen term. If none of the other columns are related to your chosen term, leave this area blank. Click “Start Reconciling.” Be prepared to wait a few minutes while OpenRefine queries the Reconciliation service.

  • After reconciling, two new facets appear: Other_Diagnosis: judgment and Other_Diagnosis: best candidate’s score. The judgment facet shows which values have been matched. Near “matched,” there may also be a value, “none,” if you ran the Reconciliation over multiple rows and some items failed to match. As you make matches between your values and the reconciled terms, the number in ‘none’ will go down and the ‘matched’ number will go up. The best candidate’s score facet has to do with how well the values matched against those from the online authority file, based on fuzzy matching.

  • Notice the check-marks next to the choices for matched values back in our original Other_Diagnosis column. As you can see, you can choose to match your original term to the term “Hypothermia” or to two other related terms. To complete the Reconciliation, we can choose one or the other. Clicking the single-check box will accept the answer for one cell. Clicking the multi-check box will accept the new value for all cells in which it appears.
  • The Reconciliation function likely worked quickly because we starred and faceted on one row, and therefore were only reconciling one term. The more terms you reconcile at once, the longer it will take for OpenRefine to complete the function. For more on OpenRefine Reconciliation services, see: https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources