Skip to Main Content

Essential Spreadsheet Data Cleaning with OpenRefine

This guide accompanies the Galter Health Sciences Library class of the same name, or can be used on its own to learn a few essential data cleaning functions of the open source application OpenRefine.

Join Data from Two OpenRefine Projects

In many data projects you may find that you need to combine data from two tables or spreadsheets. In the case of the sample dataset, NEISS product numbers are referenced, but we don’t know which products these numbers refer to. We need to bring over the names of the products from another table. This table is the other CSV file mentioned in the Getting Started section of this Guide, entitled: “NEISS_ProductCodes_ProductNames.csv”.

Steps to Join Data from Two OpenRefine Projects

  • From your main project screen, select “Open,” a button on the far right.

  • This brings you back to the Open or Create Project screen. If you have not done so previously, download CSV file “NEISS_ProductCodes_ProductNames.csv" and save it in an accessible place. From OpenRefine, Browse to this downloaded file and open it as a new project.

  • A second OpenRefine window will now open for the NEISS_ProductCodes_ProductNames spreadsheet. Set this new project up as you would any other. Choose UTF-8 under character encoding, and make sure to parse the first line of data as the column headers.

  • A screenshot with the resulting project can be seen below. Take note of the project name at the top of the screen, since it will be needed in the next step to join the projects. Instead of a dot there is a space between "ProductNames" and "csv" - this space will need to be maintained in the next step.

  • The NEISS_ProductCodes_ProductNames csv project has the key column (the one that contains the same unique values in both projects, “Product_1”), and the column that we want to bring over to our original project, “Product_Name.”

  • Go back to the window that has your original project, "2019_NEISS_SmallSubset csv".  From the Product_1 column's drop down arrow, choose Edit column - Add column based on this column.

  • In the Expression dialog box that pops up, first give your new column a name (e.g., "Product_Name"). Next we'll type a cell cross formula in order to bring over the data from the Product Codes project. The cell cross operation will always take the format: cell.cross('arg1','arg2').cells['arg3'].value[arg4], where arg1 is replaced by the name of the project that new data will be imported from, arg2 is replaced by the name of the key column in the ‘from’ project, arg3 is the name of the column you’re importing from the ‘from’ project, and arg4 is the position of the value to bring over from this column if it happens to be in the format of an array (most column values are not in array format, so this should almost always be zero).
  • In our example, the cell cross formula will read: cell.cross('NEISS_ProductCodes_ProductNames csv','Product_1').cells['Product_Name'].value[0]  Remember that the single space between "ProductNames" and "csv" must be preserved if this is how the project was named in OpenRefine.

  • The result will be that a new column will be added to your original project called "Product_Name,' which will contain the names of the products that accompany the product codes from the NEISS_ProductCodes_ProductNames spreadsheet.
  • Tip: Remember to start on the OpenRefine project that you’re importing data INTO. You may need to maintain the project that you’re importing data FROM open in a separate window.