R Module 8: Handling Missing Values

[TO BE UPDATED SOON!]

Missing values can be a problem because they distort the computer’s data analysis. In our case, we know that missing values mean there were 0 months in the dataset for a given crime type committed by female
offenders. But the computer does not know that. Hence, when calculating the average number of months for all crimes with female offenders, the computer excludes these crime types with missing values from the calculation. As a result, the average is incorrect. We can fix this by replacing missing values with zeros.

R Script (copy below code & paste into RStudio; do not copy the output results)

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 8: Handle Missing Values

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following file:

# ‘Altoona Crime Rates by Sex.csv’

# # # # STEP 1: READ IN THE DATA

AltoonaCrimeRatesbySex <- read.csv(“Altoona
Crime Rates by Sex.csv”, sep=”,”,
header=TRUE)

OUTPUT (a new data frame is created with 35 observations of 4 variables):

# Checking: See the read in data set with missing values.

View(AltoonaCrimeRatesbySex)

OUTPUT (Notice variable F has missing values where it says “NA”):

# # # # STEP 2: REPLACE MISSING VALUES WITH ZERO

# Below code says “for column F in AltoonaCrimeRates,

# if the column is NA (is.na), replace with 0″.

AltoonaCrimeRatesbySex$F[is.na(AltoonaCrimeRatesbySex$F)]
<- 0

# Checking: Have the missing values been replaced with zeros?

View(AltoonaCrimeRatesbySex)

OUTPUT (Notice the missing values under F have been replaced with zeros):

Challenges:

One of the reasons we handled missing values was because with the missing values, the computer was calculating the wrong average number of months for all crimes with female offenders. That average was found to be around 30 (30.72) in Module 7. What is the actual average now, with no missing values? Use the summary command introduced in Module 1 to find out. Was the original average underestimated or overestimated?
Update the process design so that missing values are replaced by the average rather than by the zero. What would be the answer to question 1 in this case? To answer, run Module 8 code from start, and use the alternative STEP 2 below.

# # # # STEP 2 (ALTERNATIVE):
REPLACE MISSING VALUES WITH MEAN

AltoonaCrimeRatesbySex$F[is.na(AltoonaCrimeRatesbySex$F)]
<-

round(mean(AltoonaCrimeRatesbySex$F, na.rm = TRUE))