R Module 3: Merging & Grouping

[TO BE UPDATED SOON!]

Sometimes, we can get more insight by combining multiple data sources together. In this tutorial, we are going to join the Altoona crime dataset with an Altoona population dataset to compare crime statistics to general population statistics in the area.

R Script (copy below code & paste into RStudio; do not copy the output results)

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 3: Merging & Grouping

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following files:

# ‘Lab 1 Data Altoona Crime Rates.csv’

# ‘Lab 1 Data Altoona Population Estimates.csv’

# # # # STEP 1: READ IN THE DATA

AltoonaCrimeRates <- read.csv(“Lab 1 Data Altoona Crime Rates.csv”, sep=”,”, header=TRUE)

AltoonaPopStats <- read.csv(“Lab 1 Data Altoona Population Estimates.csv”,sep=”,”, header=TRUE)

OUTPUT (2 new data frames are created
in Global Environment):

# Checking: See the read in datasets

View(AltoonaCrimeRates)

View(AltoonaPopStats)

OUTPUT 1 (table pops out for AltoonaCrimeRates):

OUTPUT 2 (table pops out for
AltoonaPopStats):

# # # # STEP 2: AGGREGATE THE DATA

# Change “Month” into a time variable that R understands.

AltoonaCrimeRates$Month <- as.Date(AltoonaCrimeRates$Month, format=”%m/%d/%Y”)

AltoonaPopStats$Month <- as.Date(AltoonaPopStats$Month, format=”%m/%d/%Y”)

# Checking: Are there any changes in Month?

View(AltoonaCrimeRates)

View(AltoonaPopStats)

OUTPUT 1 (table pops out, Month has changed format):

OUTPUT 2 (table pops, Month has
changed format):

# Collapse AltoonaCrimeRates by Month down to one observation per month

# Below code excludes columns 1-3 (includes 4-39) because they don’t aggregate well,

# and are not needed for further steps.

AltoonaCrimeRates <- aggregate(. ~ AltoonaCrimeRates$Month, data = AltoonaCrimeRates[,4:39], FUN = sum)

OUTPUT (AltoonaCrimeRates has reduced from 2326 to 55 observations):

# Checking: Is the AltoonaCrimeRates table collapsed by Month?

View(AltoonaCrimeRates)

OUTPUT
(AltooaCrimeRates has only 1 row for each Month, now named AltoonaCrimeRates$Month):

# aggregate changed the aggregating variable’s name,

# so change that back into Month

names(AltoonaCrimeRates)[names(AltoonaCrimeRates) == ‘AltoonaCrimeRates$Month’] <- ‘Month’

# Checking: Has AltoonaCrimeRates$Month been renamed to just Month? View(AltoonaCrimeRates)

OUTPUT (AltooaCrimeRates$Month has indeed been renamed to just Month):

# # # # STEP 3: JOIN THE DATA

# Below line says “define AltoonaCombinedData as a join between AltoonaCrimeRates

# and AltoonaPopStats, using Month in both datasets to join them”.

AltoonaCombinedData <- merge(AltoonaCrimeRates, AltoonaPopStats, by = “Month”)

OUTPUT (AltoonaCombinedData is created, with 36 observations of 57 variables):

# Checking: Does the new data frame have columns from both old ones?

View(AltoonaCombinedData)

OUTPUT (AltoonaCombinedData table pops out, scrolling to the right confirms it contains columns from both AltoonaCrimeRates, and AltoonaPopStats):

# # # # STEP 4: STORE THE NEW COMBINED DATASET

write.csv(AltoonaCombinedData, file=”Altoona Combined Data.csv”)

OUTPUT (refreshing Files tab shows the newly created file, ‘Altoona Combined Data.csv’):

Note on R vs. RapidMiner:

You may have noticed that, compared to RapidMiner, in R we changed the order of steps (we first aggregate, then join the datasets). This order makes the code and execution simpler in R.

Challenges:

Use the outputs you just obtained to answer the following questions.

1.Let’s compare Altoona’s crime rates with the general population estimates over time. Use the code below to compare Adult.Ethnic.Hispanic to Pop.Ethnicity.Hispanic. Have the two series moved in similar directions over time, or not?

2.Repeat this for the other 6 corresponding pairs in ethnicity, race, and general total. As you repeat this
for pairs that represent the majority of our data (e.g. white criminals & white people, or total adult criminals & total population), do you see a stronger connection between the two series, or a weaker connection? In other words, as different population groups have gone down over time, have those groups’ crime numbers followed the downward trend or not? Have Altoona’s crime rates decreased, increased, or remained constant?

3. Repeat this for people in age groups 60-64, and 65+. For which of these 2 groups is it more visible
that the crimes series and population series seem to be moving in a similar direction?

# # # # STEP 5 (after STEPS 1-4 in this Module): COMPARE VARIABLES ON A PLOT

# Set plot’s margins to have enough room for dual axes.

par(mar = c(4.1, 4.1, 4.1, 4.1))

# Plot variables Month and Adult.Ethnic.Hispanic, with red dots,

# label “Month” on x axis, no label on y axis, and no numbers on y axis.

plot(AltoonaCombinedData$Month,

AltoonaCombinedData$Adult.Ethnic.Hispanic,

col=”red”,
xlab=”Month”, ylab=NA, yaxt=”n”)

# Add lines to the plot connecting the dots, color red.

lines(AltoonaCombinedData$Month,
AltoonaCombinedData$Adult.Ethnic.Hispanic,
col=”red”)

# Add new y axis on the right.

axis(4)

# Add label “Adult.Ethnic.Hispanic” to the new y axis.

mtext(“Adult.Ethnic.Hispanic”, 4, line = 2)

# Announce new plot on the same graph.

par(new=TRUE)

# Plot variables Month and Pop.Ethnicity.Hispanic, with green dots,

# label “Month” on x axis, label “Pop.Ethnicity.Hispanic on y axis.

plot(AltoonaCombinedData$Month,

AltoonaCombinedData$Pop.Ethnicity.Hispanic,

col=”green”,
xlab=”Month”, ylab=”Pop.Ethnicity.Hispanic“)

# Add lines to the plot connecting the dots, color green.

lines(AltoonaCombinedData$Month,

AltoonaCombinedData$Pop.Ethnicity.Hispanic,col=”green”)

# Add legend in the bottom left of the graph, size cex (can be 0.0-1.0).

legend(“bottomleft”,
legend=c(“Adult.Ethnic.Hispanic”, “Pop.Ethnicity.Hispanic”),

col=c(“red”, “green”), lty=1:2, cex=0.6)