R Module 7: Pivoting

[TO BE UPDATED SOON!]

In this tutorial, we pivot the Altoona crime data from long table to wide table format. Pivoting rotates the data from a long table format (one column with a lot of rows) into a wide table format (lots of columns with a single row). This transformation is especially useful after an aggregation since the aggregated information is always stored in a long format, while machine learning models often need the data to be stored in the wide format.

For our example, we use pivoting to show how many months in our dataset each sex committed each type of crime, and whether both sexes commit all types of crime or not.

R Script (copy below code & paste into RStudio; do not copy the output results)

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 7: Pivoting & Advanced Renaming

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following file:

# ‘Lab 1 Data Altoona Crime Rates.csv’

# # # # STEP 1: READ IN THE DATA

# Read in the data set we will be using.

AltoonaCrimeRates <- read.csv(“Lab 1 Data Altoona Crime Rates.csv”)

OUTPUT (a new data frame is created with 2326 observations of 39 variables):

# # # # STEP 2: AGGREGATE
THE DATA BY COLUMNS OF INTEREST

# Load package “plyr” from the library.

# If package is not loading, first install it using the following command:

# install.packages(“plyr”)

library(plyr)

# Aggregate AltoonaCrimeRates by columns Offense.Code and Sex, counting rows (nrow).

AltoonaCrimeRates <- ddply(AltoonaCrimeRates, .(Offense.Code, Sex), nrow)

OUTPUT (data frame reduces from 2326 observations of 39 variables to 67 observations of 3 variables):

# Checking: See the aggregated data set.

View(AltoonaCrimeRates)

OUTPUT (Aggregation kept only our 3 variables of interest, Offense.Code, Sex, and new variable V1 which counts the number of rows for each combination of Offense.Code and Sex):

# # # # STEP 3: PIVOT THE DATA FROM LONG TO WIDE FORMAT

# Load package “reshape” from the library.

# If package is not loading, first install it using the following command:

# install.packages(“reshape”)

library(reshape)

# Pivot AltoonaCrimeRates as “Offense.Code” by “Sex”, use column V1 as value column.

# (Value column gives values for new columns to be formed in the pivot table.)

AltoonaCrimeRates <- cast(AltoonaCrimeRates, Offense.Code ~ Sex, value=’V1′)

OUTPUT (data frame reduces from 67 to 35 observations of 3 variables):

# Checking: See the pivot table.

View(AltoonaCrimeRates)

OUTPUT (Pivoting created 2 new columns, one for each value of Sex, and filled their rows with values from the value column V1):

# # # # STEP 4: STORE THE NEW PIVOT TABLE

write.csv(AltoonaCrimeRates, file=”Altoona Crime
Rates by Sex.csv”)

Challenges:

Looking at the resulting table, is there a crime that was never committed by females? What about the males? A missing value (“NA”) means there were no data points there before pivoting.
What crime(s) were committed least number of months by males? To answer, you can sort the table, or use the code below.

head(AltoonaCrimeRates

[order(AltoonaCrimeRates$M,
decreasing=FALSE), ],

n=5)

Who committed crimes more months on average, females or males? To answer, you can use the summary command covered in Module 1, as follows:

summary(AltoonaCrimeRates)