Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Data Cleaning with Excel

Through this guide you will learn tricks and tips to better clean and organize tabular data using Microsoft Excel.

Cleaning Scenario Using COUNTIF, Advanced Filters, and Find & Replace

The example scenario in this section of the guide is drawn from the class 10 Simple Rules for Data Cleaning with Excel, and utilizes a small subset of the National Electronic Injury Surveillance System dataset from 2018. This dataset compiles information from patient visits for emergency treatment for injuries involving commercial products. In the dataset's "Narrative" column, the patient's experience is recorded. Data entry guidelines for the dataset request that age and gender of the patients be recorded in the format "45 YOM" for "year-old-male", "53 YOF" for "year-old-female," etc. However the YOM/YOF format was not always strictly followed, resulting in messy data.

As a first step in this multi-part cleaning task, we'll find out how many rows in the spreadsheet (a small sample version used in the class) correctly used the YOM/YOF format.

COUNTIF Instructions

  • Select a column (Narrative_1 in this example), then choose an empty cell beneath it
  • From the Formulas tab, choose "Insert Function" and search for "COUNTIF"
  • Enter a range of cells over which to count (this can also by done by clicking first in the Range box and then in the corresponding cells in the spreadsheet, but be careful of unintended autofills)
  • Enter the criteria that you would like Excel to count over (i.e., count a cell if the criteria is true). Use the format "*text*" to search for a string of characters no matter where it occurs in a cell. In our example, the criteria are either "*YOM*" or "*YOF*"
  • A total for your count appears in the field where you placed the formula
  • Do a second count and add the totals, then check if they match the total number of rows

Note: No instructions are provided in the NEISS coding manual for a non-binary gender abbreviation in the Narrative field.

In our COUNTIF exercise on a small subset of the NEISS data, the total numbers of rows that use the YOM/YOF format in the Narrative field are short of the total row number (4531), confirming that incorrect format was used. In the next exercise, we'll use advanced filters to filter out the rows with the correct format so we can focus on the incorrect ones that require data cleaning.


References:

1. US Consumer Product Safety Commission. NEISS Coding Manual. January 2019. Available at: https://www.cpsc.gov/s3fs-public/2019_NEISS_Coding_Manual.pdf

2. US Consumer Product Safety Commission. NEISS Highlights, Data and Query Builder. Available at: https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx

From what we learned in the COUNTIF exercise, we know we must filter out all rows containing YOM or YOF in the Narrative column of our dataset in order to clean the data in the Narrative fields that used an incorrect format. For this task we can use advanced filters.

 

Steps to create a custom text filter:

  • From the Data Tab, choose Filter
  • Choose your column on which to filter (Narrative) and click the drop-down arrow. Then choose Text Filters (just above the Search Bar)
  • Choose Custom Filter (the last choice in the menu)
  • The dialog box that pops up offers specific criteria from which you can choose to build your custom filter, such as whether the cells in the column contain, begin with, or end with a certain string of text. The Advanced Filter option assumes that you're searching for a string of text, so the "*text*" format is not necessary
  • For our example, choose "does not contain" from the criteria choices in each of the two criteria boxes on the left. In the blank text boxes enter YOM in one and YOM if the other. Join these two criteria with "and"

The result of this advanced filtering is that all the rows containing YOM and YOF in the Narrative field will be filtered out, leaving only the rows where incorrect formats were used to express this data. These rows are now ready to be cleaned.

Note: The above is a fairly simple example of applying Advanced Filter criteria. For instructions on applying more complicated criteria, see this example from Microsoft Support.

Building on work from the COUNTIF and Advanced Filter exercises, we can now use Find & Replace to correct the incorrect age and gender abbreviations in the Narrative column. If you completed the Advanced Filter exercise first and left the filter on, then all rows using the correct YOM/YOF format are temporarily filtered out. 

Find & Replace instructions:

  • From the Home tab on the Ribbon, choose Find & Select (this feature is at the end of the Ribbon, with a magnifying glass icon)
  • Find a string (e.g., YM) and replace it with the desired text (YOM)
  • Using the Filter arrow still at the top of the Narrative_1 column, go back to Text Filters, then Custom Filter. The filter criteria you entered before should still be present. Click OK, and this filter will run again, taking out the rows that now meet the criteria due to the Find & Replace change
  • Repeat the above steps as many times as necessary to replace all the varied phrases that were used (such as 'YR old female') with YOM or YOF as appropriate