[TO BE UPDATED SOON!]
Once we have combined multiple datasets for new insights, it is possible to get even more insights by creating new columns, and then focusing on those insights by removing some old columns. The new columns may have formulas which explain our data in a new way. The old columns may have data that is not of interest in the given analysis.
In this tutorial, we are going to create & remove columns in our combined dataset to answer the following questions:
1.What is Altoona’s crime rate, and has it really increased over the period, or not?
2. Looking at the proportion of different racial and ethnic groups in Altoona’s population, are any groups significantly overrepresented, or underrepresented in recorded crimes?
R Script (copy below code & paste into RStudio; do not copy the output results)
# # # # # NSF Project “Big Data Education” (Penn State University)
# # # # # More info: http://sites.psu.edu/bigdata/
# # # # # Lab 1: Altoona Crime Rates – Module 4: Creating & Removing Columns
# # # # STEP 0: SET WORKING DIRECTORY
# Set working directory to a folder with the following file:
# ‘Altoona Combined Data.csv’
# # # # STEP 1: READ IN THE DATA
AltoonaCombinedData <- read.csv(“Altoona Combined Data.csv”,sep=”,”, header=TRUE)
|
OUTPUT (a new data frame is created with 36 observations of 58 variables):
|
# Change “Month” into a time variable that R understands.
# It is a good practice to always do this in R with time variables.
AltoonaCombinedData$Month <- as.Date(AltoonaCombinedData$Month, format=”%Y-%m-%d”)
# # # # STEP 2: CREATE NEW COLUMNS
# Define new variable ‘Crimes.By.Black’ in AltoonaCombinedData
# which equals Adult.Race.Black/Adult.Total, both from AltoonaCombinedData
AltoonaCombinedData$’%Crimes.By.Black’ <- AltoonaCombinedData$Adult.Race.Black/AltoonaCombinedData$Adult.Total
# Define new variable ‘%Pop.Black’ in AltoonaCombinedData
# which equals Pop.Race.Black/Pop.Total, both from AltoonaCombinedData
AltoonaCombinedData$’%Pop.Black’ <-
AltoonaCombinedData$Pop.Race.Black/AltoonaCombinedData$Pop.Total
# Define new variable ‘Crimes.By.White’ in AltoonaCombinedData
# which equals Adult.Race.White/Adult.Total, both from AltoonaCombinedData
AltoonaCombinedData$’%Crimes.By.White’ <-
AltoonaCombinedData$Adult.Race.White/AltoonaCombinedData$Adult.Total
# Define new variable ‘%Pop.White’ in AltoonaCombinedData
# which equals Pop.Race.White/Pop.Total, both from AltoonaCombinedData
AltoonaCombinedData$’%Pop.White’ <-
AltoonaCombinedData$Pop.Race.White/AltoonaCombinedData$Pop.Total
# Define new variable ‘%Crimes.in.Population’ in AltoonaCombinedData
# which equals Adult.Total/Pop.Total, both from AltoonaCombinedData
AltoonaCombinedData$’%Crimes.in.Population’ <-
AltoonaCombinedData$Adult.Total/AltoonaCombinedData$Pop.Total
|
OUTPUT (the data frame increases from 58 to 63 variables):
|
# Checking: See the 5 new variables in a table
View(AltoonaCombinedData)
|
OUTPUT (scroll to the right end to see the 5 newly created variables):
|
# # # # STEP 3: USE NEW COLUMNS TO CREATE EVEN NEWER COLUMNS
# Define new variable ‘Diff.Black’ in AltoonaCombinedData
# which equals ‘%Crimes.By.Black’-‘%Pop.Black’, both from AltoonaCombinedData
# (note we must use ‘quotation marks’ when calling variables with % in their name)
AltoonaCombinedData$’Diff.Black’ <-
AltoonaCombinedData$’%Crimes.By.Black’-AltoonaCombinedData$’%Pop.Black’
# Define new variable ‘Diff.White’ in AltoonaCombinedData
# which equals ‘%Crimes.By.White’-‘%Pop.White’, both from AltoonaCombinedData
AltoonaCombinedData$’Diff.White’ <-
AltoonaCombinedData$’%Crimes.By.White’-AltoonaCombinedData$’%Pop.White’
|
OUTPUT (the data frame increases from 63 to 65 variables):
|
# Checking: See the 2 new variables in a table
View(AltoonaCombinedData)
|
OUTPUT (scroll to the right end to see the 2 newly created variables):
|
# # # # STEP 4: REMOVE UNIMPORTANT COLUMNS
# Remove all columns except those listed here, i.e. keep only the listed columns.
AltoonaCombinedData <- AltoonaCombinedData[ , c(“Month”,
“Diff.Black”, “Diff.White”,
“%Crimes.in.Population”)]
|
OUTPUT (the data frame decreases from 65 to 4 variables):
|
# Checking: See the 4 remaining variables in a table
View(AltoonaCombinedData)
|
OUTPUT:
|
Challenges:
1. What is Altoona’s crime rate (range), and has it really increased over the period, or not? To answer, you can use functions summary and View that we covered in Module 1 to look at the numerical values of variable %Crimes.in.Population, as well as function plot for charts that we covered in Module 3 to look at the variable over time on a graph. See code in STEP 5 below.
2.Looking at the proportion of black and white racial groups in Altoona’s population, are they significantly overrepresented, or underrepresented in recorded crimes? To answer, you can use function plot for charts that we covered in Module 3. See STEP 6 below.
3. Looking instead at the proportion of Hispanic and non-Hispanic ethnic groups, do we see a similar
pattern of over- and underrepresentation as with black and white race? To answer, you can adjust Module 4 code above, and STEP 6 code below to the new variables of interest. Make sure you re-run Module 4 code from start to have the right variables in the end!
# # # # STEP 5 (after STEPS 1-4 in this Module): INSPECT THE CRIME RATE VARIABLE
summary(AltoonaCombinedData$’%Crimes.in.Population’)
View(AltoonaCombinedData$’%Crimes.in.Population’)
plot(AltoonaCombinedData$Month,
AltoonaCombinedData$’%Crimes.in.Population’,
xlab=”Month”, ylab=”%Crimes.in.Population”)
lines(AltoonaCombinedData$Month,
AltoonaCombinedData$’%Crimes.in.Population’)
# # # # STEP 6 (after STEPS 1-5 in this Module): COMPARE THE RACE VARIABLES
# Set plot’s margins to have enough room for dual axes.
par(mar = c(4.1, 4.1, 4.1, 4.1))
# Plot variables Month and Diff.Black, with red dots,
# x axis labelled “Month”, y axis with no label, no numbers, and ranging (-0.2,0.2).
# (This is because the 2 variables we are comparing have a (-0.2,0.2) range,
# Diff.Black goes up to 0.2, Diff.White goes down to -0.2).
plot(AltoonaCombinedData$Month, AltoonaCombinedData$Diff.Black,
col=”red”,
xlab=”Month”, ylab=NA, yaxt=”n”, ylim=c(-0.2,0.2))
# Add lines to the plot connecting the dots, color red.
lines(AltoonaCombinedData$Month,
AltoonaCombinedData$Diff.Black, col=”red”)
# Add new y axis on the right.
axis(4)
# Add label “Diff.Black” to the new y axis.
mtext(“Diff.Black”, 4, line = 2)
# Announce new plot on the same graph.
par(new=TRUE)
# Plot variables Month and Diff.White, with green dots,
# label “Month” on x axis, label “Diff.White” on y axis.
plot(AltoonaCombinedData$Month, AltoonaCombinedData$Diff.White,
col=”green”,
xlab=”Month”, ylab=”Diff.White”, ylim=c(-0.2,0.2))
# Add lines to the plot connecting the dots, color green.
lines(AltoonaCombinedData$Month, AltoonaCombinedData$Diff.White,
col=”green”)
# Add legend in the bottom left of the graph.
legend(“bottomleft”,
legend=c(“Diff.Black”, “Diff.White”),
col=c(“red”, “green”), lty=1:2, cex=0.8)
Next Page: Previous Page:






