R Module 5: Modeling

[TO BE UPDATED SOON!]

In this tutorial, we prepare data for modeling by discretizing the dependent variable by putting it into bins. We then predict the profile of Altoona offenders most likely to commit a crime based on sex, age, race, and
ethnicity. In our modeling, we use a classic machine learning method called decision tree.

R Script (copy below code & paste into RStudio; do not copy the output results)

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 5: Modeling

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following file:

# ‘Lab 1 Data Altoona Crime Rates.csv’

# # # # STEP 1: READ IN THE DATA

AltoonaCrimeRates <- read.csv(“Lab 1 Data Altoona Crime Rates.csv”)

OUTPUT (a new data frame is created in Global Environment with 2326 observations of 39 variables):

# # # # STEP 2: DISCRETIZE THE DEPENDENT VARIABLE BY BINNING

# Load package “rattle” from the library:

# Packages are like mini-programs developed by other R users.

# They can turn numerous lines of code into a single command line,

# as you will see below:

# package “rattle” is needed to use the “binning” function in the next command.

library(rattle)

# If package is not loading (shows error), first install it using the following command:

# install.packages(“rattle”) # Remove first hashtag (#) in this line to use the code.

# If installing, make sure to use the above “library” command afterward.

# Discretize variable Adult Total into 2 bins (2 inetervals covering all its values).

# Code below says: “create a new column in AltoonaCrimeRates called

# Adult.Total.Binned, which puts each value of Adult Total into one of 2 bins”.

AltoonaCrimeRates$Adult.Total.Binned <- binning(AltoonaCrimeRates$Adult.Total,
2)

OUTPUT (data frame AltoonaCrimeRates increases from 39 to 40 variables):

# See the range of your 2 created bins.

summary(AltoonaCrimeRates$Adult.Total.Binned)

OUTPUT:

[0,3] (3,73]

1195 1131

INTERPRETATION:

Adult.Total.Binned has 2 bins, bin 1 has
observations with values 0 to 3 (including 3), bin 2 has observations with
values above 3 and below 73 (including 73).

# # # # STEP 3: MODEL A DECISION TREE

# Load package “rpart” from the library:

library(rpart)

# If package is not
loading (shows error), first install it using the following command:

# install.packages(“rpart“)
# Remove first hashtag (#) in this line to use the code.

# If installing, make sure to use the above “library” command afterward.

# Create a decision tree with Adult.Total.Binned as the dependent variable.

# Below code says:
“Define decision tree called Tree using rpart so that

# variable Adult.Total.Binned is the dependent variable,

# other adult variables (from Age.18 to Adult.Ethnic.Non.Hispanic) are independent,

# method of the decision tree is a regression tree (“anova“),

# dataset used is AltoonaCrimeRates,

# and in other control settings the minimum split is 4,

# minimum bucket (i.e.bin) size is 1 observation, complexity parameter (cp) is 0.002″.

# (cp is the minimum improvement needed for the tree to have an additional node/branch.)

Tree <- rpart(Adult.Total.Binned
~ Age.18 + Age.19 +
Age.20 + Age.21 +
Age.22 + Age.23 +
Age.24 + Age.25.29 +
Age.30.34 + Age.35.39 +
Age.40.44 + Age.45.49 +
Age.50.54 + Age.55.59 +
Age.60.64 + Age.65. +
+ Adult.Race.White + Adult.Race.Black + Adult.Race.Native.American +

Adult.Race.Asian.Pacific + Adult.Ethnic.Hispanic
+ Adult.Ethnic.Non.Hispanic,

method=”anova“, data=AltoonaCrimeRates,

control=rpart.control(minsplit=4, minbucket=1, cp=0.002))

OUTPUT (a new decision tree called “Tree” is created in Global Environment):

# # # # STEP 4: VISUALIZE THE DECISION TREE (BASIC)

# Plot decision tree (basic).

plot(Tree, uniform=TRUE,
main=”Decision Tree for Adult Total”)

OUTPUT (a new decision tree is plotted in the Plots tab, without labels):

# Add text to the decision tree (basic).

text(Tree, use.n=TRUE,
all=TRUE, cex=.8)

OUTPUT (the plotted decision tree gets labels, but they are partly illegible):

# # # # STEP 5: VISUALIZE THE DECISION TREE (FANCY USING fancyRpartPlot)

# Load package “rpart.plot” from the library:

library(rpart.plot)

# If package is not loading (shows error), first install it using the following command:

# install.packages(“rpart.plot“)
# Remove first hashtag (#) in this line to use the code.

# If installing, make sure to use the above “library” command afterward.

# Load package “RColorBrewer” from the library:

library(RColorBrewer)

# If package is not loading (shows error), first install it using the following command:

# install.packages(“RColorBrewer“)
# Remove first hashtag (#) in this line to use the code.

# If installing, make sure to use the above “library” command afterward.

# Plot decision tree (fancy).

fancyRpartPlot(Tree)

OUTPUT (a new decision tree is plotted in the Plots tab, looking
much better):

INTERPRETATION:

Starting from the top,
each branch to the left (“yes”) means the condition is satisfied
(e.g. Adult.Ethnic.Non.Hispanic
< 3.5 is a “yes”), and each branch to the right means the
condition is not satisfied (e.g. Adult.Ethnic.Non.Hispanic
< 3.5 is a “no”, i.e. that branch holds observations for which Adult.Ethnic.Non.Hispanic >= 3.5).

The bottom leaves all say
either 1 or 2 – meaning in each bottom leaf, the observations all either
belong to bin 1 of the dependent variable Adult.Total.Binned, or to bin 2 of Adult.Total.Binned.

The dependent variable’s
name (Adult.Total.Binned)
is not actually written in the graph; we know it from setting up the tree in
code. The bin ranges are also not written; we know them from looking them up
in code: [0,3], and (3,73].

EXAMPLE 1 (right bottom
leaf):

There are 1123
observations that belong to bin 2 of Adult.Total.Binned
(meaning there were at least 3 offenses for that crime in that month) for
which Adult.Ethnic.Non.Hispanic >= 0. In simpler
words, when there are more than 3 offenses by adult non-Hispanic offenders,
there are usually more adult offenses in total (bin 2, from 3 to 73).

Challenges:

What happens to the model if you discretize the dataset into 3 bins instead of 2? Does the decision tree get more branches, or less than the current one?
What happens to the model if you keep discretization at 2 bins, but when selecting independent variables for the decision tree you drop from the selection all those starting with Adult… so that selected variables are just different adult age ranges? Does the decision tree get more branches, or less than the current one? Important: of course, you do not delete the dependent variable (Adult.Total.Binned) from the model.
What if instead of predicting adult crimes, you want to predict juvenile crimes, using Juvenile.Total as the dependent variable? How would you change the whole model? What do you find to be the main determinants of juvenile crime in Altoona? Is the conclusion similar to adult crimes or not? Once you changed the code, keep executing all commands even if basic plotting (STEP 4) gives you a warning message.