GalterGuides: Cleaning Spreadsheet Data with OpenRefine: Advanced: Webscraping

Retrieving Data from URLs

A powerful feature of OpenRefine is the ability to pull external data from URLs. This exercise will apply this feature to the Other_Diagnosis column of the sample dataset. Pulling in data from outside sources can be time-consuming as OpenRefine must query them, therefore we'll work with just one row.

Steps for Fetching Data from a URL

In the left-most column there is a blank star. Click this to apply a star to row 4, which has a value of Asthma for Other_Diagnosis. Then from the All column on the far left, select Facet, then Facet by star. This will result in Boolean-style true/false choices in the Facet area.

Select true from the choices that come up in the Facet area. Now only row 4 should be selected.
From the Other_Diagnosis column drop down arrow, select Edit column, then Add column by fetching URL.

For the new column's title, use Wikipedia_Info. In the Expression box, type the following, including the spaces:

"https://en.wikipedia.org/w/api.php?action=opensearch&action=query&generator=prefixsearch&gpssearch="

+ value.escape('url')

+ "&prop=extracts&exintro=1&explaintext=1&redirects=1&limit=10&namespace=0&format=xml"

The result will be that a long string of HTML encoded text will be added in the new column next to Other_Diagnosis

See the next section to learn how to parse useful bits from this data.

Cleaning Spreadsheet Data with OpenRefine

Retrieving Data from URLs

Steps for Fetching Data from a URL

Northwestern University Feinberg School of Medicine

Northwestern University
Feinberg School of Medicine