A powerful feature of OpenRefine is the ability to pull external data from URLs. This exercise will apply this feature to the Other_Diagnosis column of the sample dataset. Pulling in data from outside sources can be time-consuming as OpenRefine must query them, therefore we'll work with just one row.
Steps for Fetching Data from a URL
- In the left-most column there is a blank star. Click this to apply a star to row 4, which has a value of Asthma for Other_Diagnosis. Then from the All column on the far left, select Facet, then Facet by star. This will result in Boolean-style true/false choices in the Facet area.
- Select true from the choices that come up in the Facet area. Now only row 4 should be selected.
- From the Other_Diagnosis column drop down arrow, select Edit column, then Add column by fetching URL.
- For the new column's title, use Wikipedia_Info. In the Expression box, type the following, including the spaces:
"https://en.wikipedia.org/w/api.php?action=opensearch&action=query&generator=prefixsearch&gpssearch="
+ value.escape('url')
+ "&prop=extracts&exintro=1&explaintext=1&redirects=1&limit=10&namespace=0&format=xml"
- The result will be that a long string of HTML encoded text will be added in the new column next to Other_Diagnosis
See the next section to learn how to parse useful bits from this data.