• Log In
 Visit the Pennsylvania State University Home Page

Data Science Tools

  • Home
  • About
  • Data Exploration
    • Lab 1 RapidMiner Modules
      • RM Module 1: Accessing Data
      • RM Module 2: Filtering & Sorting
      • RM Module 3: Merging & Grouping
      • RM Module 4: Creating & Removing Columns
      • RM Module 5: Changing Types & Roles for Modeling
      • RM Module 6: Normalization & Detecting Outliers
      • RM Module 7: Pivoting & Advanced Renaming
      • RM Module 8: Handling Missing Values
      • RM Module 9: Macros & Sampling
      • RM Module 10: Looping & Branching
    • Lab 1 Tableau Modules
      • T Module 1: Accessing Data
      • T Module 2: Filtering & Sorting
      • T Module 3: Merging & Grouping
      • T Module 4: Creating & Hiding Columns
      • T Module 5: Predictive Modeling
    • Lab 1 R Modules
      • R Module 1: Accessing Data
      • R Module 2: Filtering & Sorting
      • R Module 3: Merging & Grouping
      • R Module 4: Creating & Removing Columns
      • R Module 5: Predictive Modeling
      • R Module 6: Normalization & Detecting Outliers
      • R Module 7: Pivoting
      • R Module 8: Handling Missing Values
      • R Module 9: Sampling
      • R Module 10: Looping
  • Machine Learning
  • Data Cleaning
  • Text Analysis
  • Help

R Module 2: Filtering & Sorting

[TO BE UPDATED SOON!]

In the last tutorial, we learned how to retrieve your data, now it is time to learn how to use that dataset and manipulate it so you can actually understand what it is saying. In this tutorial, you we will apply a filter to
the Altoona crime rates data to only look at the most common crimes committed by juvenile offenders.

 

R Script (copy below code & paste into RStudio; do not copy the output results)

 

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 2: Filtering & Sorting 

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following file:

# ‘Lab 1 Data Altoona Crime Rates.csv’ 

# # # # STEP 1: READ IN THE DATA

AltoonaCrimeRates <- read.csv(“Lab 1 Data Altoona Crime Rates.csv”, sep=”,”, header=TRUE)

 

OUTPUT (a new data frame
AltoonaCrimeRates is created, with 2326 observations of 39 variables):

 

 

 

# Checking: See the read in data

View(AltoonaCrimeRates)

 

OUTPUT (the table pops out):

 

 

 

# # # # STEP 2: FILTER DOWN TO THE DATA OF INTEREST

# Keep only rows where Juvenile.Total>0

# Below code says “define AltoonaCrimeRates as AltoonaCrimeRates where

# the rows satisfy condition that Juvenile.Total is greater than 0″

# (because with [,] the condition before comma applies to rows,

# and condition after comma would have applied to columns, but there is none here).

AltoonaCrimeRates <- AltoonaCrimeRates[AltoonaCrimeRates$Juvenile.Total > 0, ]

 

OUTPUT (data frame
AltoonaCrimeRates reduces from 2326 to 798 observations for which variable
Juvenile.Total is greater than zero):

 

 

 

# Checking: Have the zeros under Juvenile.Total been filtered out?

View(AltoonaCrimeRates)

 

OUTPUT (the table pops out, column Juvenile.Total now has no
rows with zeros):

 

 

 

# # # # STEP 3: SORT DATA BY THE COLUMN OF INTEREST

# Sort data by Juvenile.Total

# Below code says “show first 5 rows (head & n=5) of AltoonaCrimeRates, where

# its rows are ordered by Juvenile.Total in a decreasing order, and

# its columns to keep are only Offense.Code, Juvenile.Total, Sex, and Month”.

head(AltoonaCrimeRates

     [order(AltoonaCrimeRates$Juvenile.Total,
decreasing=TRUE),

       c(“Offense.Code”, “Juvenile.Total”,
“Sex”, “Month”)],

     n=5)

 

OUTPUT:

 


Offense.Code Juvenile.Total Sex     Month

1201    140-Vandalism            26   M  5/1/2015

1202    140-Vandalism            17   F  5/1/2015

1452     290-Runaways            14   F 10/1/2015

295 060-Larceny-Theft            12   M  8/1/2013

710      290-Runaways            12   M  5/1/2014

508 060-Larceny-Theft            11   M  1/1/2014

 

 

# # # # STEP 4: AGGREGATE DATA TO GET MORE INFORMATION OF INTEREST

# Collapse data by Offense.Code (one row per crime type)

# Below code says “aggregate AltoonaCrimeRates by Offense.Code

# and apply summation”.

AltoonaCrimeRates <-


aggregate(. ~ AltoonaCrimeRates$Offense.Code,


data = AltoonaCrimeRates,


FUN = sum)

 

OUTPUT (data frame AltoonaCrimeRates reduces from 798 to 32 observations, one for each different value of Offense.Code):

 

 

 

# Checking: Is the table collapsed by Offense.Code?

View(AltoonaCrimeRates)

 

OUTPUT (the table pops out, with a new column
AltoonaCrimeRates$Offense.Code under which each offense shows up only once,
and with columns Month, Offense.Code, and Sex unnecessarily aggregated):

 

 

 

# Remove columns which no longer make sense

# (Columns Month, Offense.Code, and Sex are respectively columns 2-4)

AltoonaCrimeRates <- AltoonaCrimeRates[ ,-c(2:4)]

 

# Checking: Have the columns been removed?

View(AltoonaCrimeRates)

 

OUTPUT (the table pops out, with the 3 columns removed):

 

 

 

# Rename’AltoonaCrimeRates$Offense.Code’ into ‘Offense.Code’.

names(AltoonaCrimeRates)[names(AltoonaCrimeRates)==’AltoonaCrimeRates$Offense.Code’]
<- ‘Offense.Code’

 

# Checking: Has thecolumn been renamed?

View(AltoonaCrimeRates)

 

OUTPUT (the table pops out, with first column renamed to
“Offense.Code”):

 

 

 

# Look at the resulting dataset, again ordered by Total.Juvenile

head(AltoonaCrimeRates

     [order(AltoonaCrimeRates$Juvenile.Total,
decreasing=TRUE),

       c(“Offense.Code”,”Juvenile.Total”)],

     n=5)

 

OUTPUT (data frame
AltoonaCrimeRates reduces from 2326 to 798 observations for which variable
Juvenile.Total is greater than zero):


Offense.Code Juvenile.Total

32                           290-Runaways           455

8     080-Other Assaults – Not Aggravated           340

6                       060-Larceny-Theft           258

26                        220-Liquor Law           199

28                 240-Disorderly Conduct           189

30 260-All Other Offenses (Except Traffic)            137

 

 

 

Challenges:

 

Use the outputs you just obtained to answer the following questions.

 

  1. Looking at the resulting data, what is the most common crime that male juvenile offenders committed in one month? What about the female juvenile offenders? Hint: to answer for female offenders, remember you can change in the “head” command how many top rows you see (from n=5 to e.g. n=50 or n=100).
  2. In what month did both of the above common crimes occur?
  3. Looking at the resulting data, what is the most common crime that juvenile offenders, regardless of sex, committed across all months? How many times was it committed?
  4. Change the code so it looks at adult offenders rather than juvenile offenders, and answer questions 1-3 for them.

 

 

 

Next Page: R Module 3: Merging & GroupingPrevious Page: R Module 1: Accessing Data

Follow me on Twitter

My Tweets
 Visit the Pennsylvania State University Home Page
Copyright 2025 © The Pennsylvania State University Privacy Non-Discrimination Equal Opportunity Accessibility Legal