The example scenario in this section of the guide is drawn from the class 10 Simple Rules for Data Cleaning with Excel, and utilizes a small subset of the National Electronic Injury Surveillance System dataset from 2018. This dataset compiles information from patient visits for emergency treatment for injuries involving commercial products. In the dataset's "Narrative" column, the patient's experience is recorded. Data entry guidelines for the dataset request that age and gender of the patients be recorded in the format "45 YOM" for "year-old-male", "53 YOF" for "year-old-female," etc. However the YOM/YOF format was not always strictly followed, resulting in messy data.
As a first step in this multi-part cleaning task, we'll find out how many rows in the spreadsheet (a small sample version used in the class) correctly used the YOM/YOF format.
Note: No instructions are provided in the NEISS coding manual for a non-binary gender abbreviation in the Narrative field.
In our COUNTIF exercise on a small subset of the NEISS data, the total numbers of rows that use the YOM/YOF format in the Narrative field are short of the total row number (4531), confirming that incorrect format was used. In the next exercise, we'll use advanced filters to filter out the rows with the correct format so we can focus on the incorrect ones that require data cleaning.
References:
1. US Consumer Product Safety Commission. NEISS Coding Manual. January 2019. Available at: https://www.cpsc.gov/s3fs-public/2019_NEISS_Coding_Manual.pdf
2. US Consumer Product Safety Commission. NEISS Highlights, Data and Query Builder. Available at: https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx
From what we learned in the COUNTIF exercise, we know we must filter out all rows containing YOM or YOF in the Narrative column of our dataset in order to clean the data in the Narrative fields that used an incorrect format. For this task we can use advanced filters.
The result of this advanced filtering is that all the rows containing YOM and YOF in the Narrative field will be filtered out, leaving only the rows where incorrect formats were used to express this data. These rows are now ready to be cleaned.
Note: The above is a fairly simple example of applying Advanced Filter criteria. For instructions on applying more complicated criteria, see this example from Microsoft Support.
Building on work from the COUNTIF and Advanced Filter exercises, we can now use Find & Replace to correct the incorrect age and gender abbreviations in the Narrative column. If you completed the Advanced Filter exercise first and left the filter on, then all rows using the correct YOM/YOF format are temporarily filtered out.