Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Add the MeSH SPARQL Endpoint

If you've completed installation of the RDF extension as outlined in the previous exercise, you'll notice a new box in the upper right corner of your OpenRefine project screen, by Extensions, labeled "RDF." In this exercise, we will add a reconciliation service using this resource description framework extension. A reconciliation service works by querying websites and bringing controlled lists of terms over to your OpenRefine project. In this case we will add a service for MeSH, the Medical Subject Headings list, so we can compare terms from the Other_Diagnosis field in the sample dataset to the controlled MeSH terms used in PubMed and many biomedical repositories.

Steps to Add the MeSH SPARQL Endpoint for Reconciliation

  • From the RDF extension drop down menu at the top right of your OpenRefine project screen, choose the option to add a reconciliation service.

Screenshot showing the newly added RDF extension drop-down menu in OpenRefine

  • From the drop down list of options, choose to add a service based on a SPARQL endpoint. A SPARQL endpoint is a kind of port on an HTML website or web service that is capable of receiving queries in a specialized format. When you add such an endpoint to OpenRefine, it allows the program to query the web service.

Screenshot showing the options of types of reconciliation services that can be added in OpenRefine

  • In the SPARQL Endpoint dialog box, provide a human-readable name for your SPARQL-based reconciliation service. Then enter the Medical Subject Headings (MeSH) SPARQL endpoint: http://id.nlm.nih.gov/mesh/sparql  Then click OK.

Screenshot showing how to add a SPARQL-based reconciliation service in OpenRefine

  • The SPARQL endpoint will now be available under the name provided, in this case "MeSH Reconciliation," whenever you would like to reconcile OpenRefine columns against Medical Subject Heading terms. Learn how to do this in the next section.