Skip to Main Content

Data Cleaning with Excel

Through this guide you will learn tricks and tips to better clean and organize tabular data using Microsoft Excel.

Controlling Terms: Cell Validation and Custom Lists

With the cell validation feature you can set up your cells to only accept certain values, even from the beginning of data collection. Much like a validated field in REDCap, this prevents any terms other than those on the list from being entered as data values. For this example we'll work with the Other Diagnosis column in the NEISS small subset spreadsheet.

Setting up Cell Validation Instructions

  • Create a new column titled "Diagnosis Controlled Vocab" next to Other_Diagnosis. Fill it with a small sample of diagnoses
  • Select the cells you wish to validate, in this case all the cells in Other_Diagnosis (M2:M4531)
  • From the Data Tab choose Data Validation. In the resulting list of choices, choose Data Validation

Screen Clipping showing the Data Validation setup dialog box, accessible from the Data Validation option on the Data tab on the Excel ribbon. The source field is circled to show the format for entering the parameters of the source cells.

  • Under Validation Criteria you have many choices of the kinds of values you can allow such as numbers formatted in certain ways, dates, etc. Choose 'List.'
  • For the Source, click in the white box, then in the first cell of the values in the 'Diagnosis Controlled Vocab' column, type : , delete anything that auto-fills, and replace it with $(Letter)$(Number) for the last cell in the 'Diagnosis Controlled Vocab' column. The result will be something like $N$2:$N$8. Click OK.
  • Once setup is done, go back to the Other_Diagnosis column, choose a field, and try to enter a data value that is not on the controlled list

Screen Clipping showing the drop-down menu that can be accessed in the data-validated column after validation has been set up. A red arrow points to this menu for emphasis.

  • Notice that a drop-down list of your entire controlled vocabulary appears. If you try to enter a term that is not on the list, you should see this error:

Screen Clipping showing the error message that the user receives when they try to enter a value in a cell in a data-validated column that is not present in the source list. The incorrect value is circled for emphasis.

Custom lists are used to make sort orders appear in Excel in a specified way. For instance, if you had a column where the data values were the spelled-out months of the year and sorted it, the order would be alphabetical (April, August, etc.) rather than the order in which the months occur in the year. Here we'll demonstrate how to apply a custom list while also doing more data wrangling on the Months column from the Splitting Multi-Valued Cells exercise. If you completed that exercise with the NEISS small subset dataset, you should have a column called Month with the months represented by their numbers.

Steps to Apply a Custom Sort to the Month column

  • Insert a new column next to Month called 'Month Spelled Out'
  • In the first cell (row 2) enter the formula =TEXT(E2*29,"mmmm") This converts the number representing the month to the spelled out month name
  • Double click in the lower right corner to copy this formula down the spreadsheet

Screen Clipping showing how to find the Edit Custom Lists button within the Advanced Excel options, which is accessible by going to File from the ribbon, then Options, then Advanced. Edit Custom Lists is circled for emphasis.

  • Next, in the Ribbon go to File, then Options, then Advanced, then General, then click the 'Edit Custom Lists' button. This is simply to demonstrate how to find Excel's built-in collection of custom lists. A list already exists for the months of the year spelled out. Notice that there is an option to create a new list, which will be saved in the system for future use. Click OK.

Screen Clipping showing the Edit Custom Lists dialog box accessible from the path File, Options, Advanced, General in Excel. Each custom list is displayed on the left with an edit screen on the right.

  • Back in the worksheet, go to the Data Tab on the Ribbon, then Sort. In the dialog box that pops up, check that your data has headers, for Sort By choose the Months Spelled Out column, and in the order option choose Custom List. Choose the months option

Screen Clipping showing where to find the Custom List option in the Excel Sort menu. Custom List is circled for emphasis.

  • The result will be that your spreadsheet's data will now sort in the order in which the months occur in the year.