Installing extensions is an important first step to taking advantage of some of the online services with which OpenRefine can interact. A list of currently supported extensions is available on the OpenRefine downloads page. In this exercise we'll install the RDF extension. RDF stands for “Resource Description Framework,” which is a framework for representing information on the Internet. With this extension in place, we can query APIs and SPARQL endpoints, which is a way of bringing information from outside websites into our OpenRefine projects.
Navigate to the OpenRefine downloads page and under the List of Extensions, click on RDF Extension Latest.
Find your downloaded RDF extension zip file. Click to Extract All, and when asked for a destination folder navigate to the new rdf-extension folder you created in your OpenRefine extensions. Extract to this location.
If you've completed installation of the RDF extension as outlined in the previous exercise, you'll notice a new box in the upper right corner of your OpenRefine project screen, by Extensions, labeled "RDF." In this exercise, we will add a reconciliation service using this resource description framework extension. A reconciliation service works by querying websites and bringing controlled lists of terms over to your OpenRefine project. In this case we will add a service for MeSH, the Medical Subject Headings list, so we can compare terms from the Other_Diagnosis field in the sample dataset to the controlled MeSH terms used in PubMed and many biomedical repositories.
Reconciliation services allow you to look up terms from your data source in other sources online that maintain controlled vocabularies, such as Wikidata or MeSH. This allows you to add to your project metadata values that are expressed in controlled ways, which can help with interoperability. In this example, we will reconcile one value from the Other_Diagnosis column in the sample dataset against the Medical Subject Headings (MeSH) in order to find the MeSH-controlled way of expressing it.
Star a row which contains a term in the Other_Diagnosis field that you’d like to reconcile against MeSH. In the example we’ll use the row with ‘hypothermia’ in Other_Diagnosis. Facet by Star from the All column so that only this row is selected. We include this step so that reconciliation will work over only one term. Since a website must be queried in the process, reconciling against thousands of terms can take a few hours.
After reconciling, two new facets appear: Other_Diagnosis: judgment and Other_Diagnosis: best candidate’s score. The judgment facet shows which values have been matched. Near “matched,” there may also be a value, “none,” if you ran the Reconciliation over multiple rows and some items failed to match. As you make matches between your values and the reconciled terms, the number in ‘none’ will go down and the ‘matched’ number will go up. The best candidate’s score facet has to do with how well the values matched against those from the online authority file, based on fuzzy matching.