R Module 2: Filtering & Sorting

[TO BE UPDATED SOON!]

In the last tutorial, we learned how to retrieve your data, now it is time to learn how to use that dataset and manipulate it so you can actually understand what it is saying. In this tutorial, you we will apply a filter to
the Altoona crime rates data to only look at the most common crimes committed by juvenile offenders.

R Script (copy below code & paste into RStudio; do not copy the output results)

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 2: Filtering & Sorting

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following file:

# ‘Lab 1 Data Altoona Crime Rates.csv’

# # # # STEP 1: READ IN THE DATA

AltoonaCrimeRates <- read.csv(“Lab 1 Data Altoona Crime Rates.csv”, sep=”,”, header=TRUE)

OUTPUT (a new data frame
AltoonaCrimeRates is created, with 2326 observations of 39 variables):

# Checking: See the read in data

View(AltoonaCrimeRates)

OUTPUT (the table pops out):

# # # # STEP 2: FILTER DOWN TO THE DATA OF INTEREST

# Keep only rows where Juvenile.Total>0

# Below code says “define AltoonaCrimeRates as AltoonaCrimeRates where

# the rows satisfy condition that Juvenile.Total is greater than 0″

# (because with [,] the condition before comma applies to rows,

# and condition after comma would have applied to columns, but there is none here).

AltoonaCrimeRates <- AltoonaCrimeRates[AltoonaCrimeRates$Juvenile.Total > 0, ]

OUTPUT (data frame
AltoonaCrimeRates reduces from 2326 to 798 observations for which variable
Juvenile.Total is greater than zero):

# Checking: Have the zeros under Juvenile.Total been filtered out?

View(AltoonaCrimeRates)

OUTPUT (the table pops out, column Juvenile.Total now has no
rows with zeros):

# # # # STEP 3: SORT DATA BY THE COLUMN OF INTEREST

# Sort data by Juvenile.Total

# Below code says “show first 5 rows (head & n=5) of AltoonaCrimeRates, where

# its rows are ordered by Juvenile.Total in a decreasing order, and

# its columns to keep are only Offense.Code, Juvenile.Total, Sex, and Month”.

head(AltoonaCrimeRates

[order(AltoonaCrimeRates$Juvenile.Total,
decreasing=TRUE),

c(“Offense.Code”, “Juvenile.Total”,
“Sex”, “Month”)],

n=5)

OUTPUT:

Offense.Code Juvenile.Total Sex Month

1201 140-Vandalism 26 M 5/1/2015

1202 140-Vandalism 17 F 5/1/2015

1452 290-Runaways 14 F 10/1/2015

295 060-Larceny-Theft 12 M 8/1/2013

710 290-Runaways 12 M 5/1/2014

508 060-Larceny-Theft 11 M 1/1/2014

# # # # STEP 4: AGGREGATE DATA TO GET MORE INFORMATION OF INTEREST

# Collapse data by Offense.Code (one row per crime type)

# Below code says “aggregate AltoonaCrimeRates by Offense.Code

# and apply summation”.

AltoonaCrimeRates <-

aggregate(. ~ AltoonaCrimeRates$Offense.Code,

data = AltoonaCrimeRates,

FUN = sum)

OUTPUT (data frame AltoonaCrimeRates reduces from 798 to 32 observations, one for each different value of Offense.Code):

# Checking: Is the table collapsed by Offense.Code?

View(AltoonaCrimeRates)

OUTPUT (the table pops out, with a new column
AltoonaCrimeRates$Offense.Code under which each offense shows up only once,
and with columns Month, Offense.Code, and Sex unnecessarily aggregated):

# Remove columns which no longer make sense

# (Columns Month, Offense.Code, and Sex are respectively columns 2-4)

AltoonaCrimeRates <- AltoonaCrimeRates[ ,-c(2:4)]

# Checking: Have the columns been removed?

View(AltoonaCrimeRates)

OUTPUT (the table pops out, with the 3 columns removed):

# Rename’AltoonaCrimeRates$Offense.Code’ into ‘Offense.Code’.

names(AltoonaCrimeRates)[names(AltoonaCrimeRates)==’AltoonaCrimeRates$Offense.Code’]
<- ‘Offense.Code’

# Checking: Has thecolumn been renamed?

View(AltoonaCrimeRates)

OUTPUT (the table pops out, with first column renamed to
“Offense.Code”):

# Look at the resulting dataset, again ordered by Total.Juvenile

head(AltoonaCrimeRates

[order(AltoonaCrimeRates$Juvenile.Total,
decreasing=TRUE),

c(“Offense.Code”,”Juvenile.Total”)],

n=5)

OUTPUT (data frame
AltoonaCrimeRates reduces from 2326 to 798 observations for which variable
Juvenile.Total is greater than zero):

Offense.Code Juvenile.Total

32 290-Runaways 455

8 080-Other Assaults – Not Aggravated 340

6 060-Larceny-Theft 258

26 220-Liquor Law 199

28 240-Disorderly Conduct 189

30 260-All Other Offenses (Except Traffic) 137

Challenges:

Use the outputs you just obtained to answer the following questions.

Looking at the resulting data, what is the most common crime that male juvenile offenders committed in one month? What about the female juvenile offenders? Hint: to answer for female offenders, remember you can change in the “head” command how many top rows you see (from n=5 to e.g. n=50 or n=100).
In what month did both of the above common crimes occur?
Looking at the resulting data, what is the most common crime that juvenile offenders, regardless of sex, committed across all months? How many times was it committed?
Change the code so it looks at adult offenders rather than juvenile offenders, and answer questions 1-3 for them.