[TO BE UPDATED SOON!]
What if you wanted to get a specific sample of your dataset, e.g. exactly 500 observations for each value of the variable Sex (500 for Female and 500 for Male)? You can loop over the 2 sexes and sample the examples for each sex individually down to 500 or less.
R Script (copy below code & paste into RStudio; do not copy the output results)
# # # # # NSF Project “Big Data Education” (Penn State University)
# # # # # More info: http://sites.psu.edu/bigdata/
# # # # # Lab 1: Altoona Crime Rates – Module 10: Looping
# # # # STEP 0: SET WORKING DIRECTORY
# Set working directory to a folder with the following file:
# ‘Lab 1 Data Altoona Crime Rates.csv’
# # # # STEP 1: READ IN THE DATA
AltoonaCrimeRates <- read.csv(“Lab 1 Data Altoona Crime Rates.csv”)
OUTPUT (a new data frame is created with 2326 observations of 39 variables):
|
# # # # STEP 2: SAMPLE DATA USING A LOOP
# Loop: For each value of Sex, create a separate dataframe, sample 500; then append.
# # # STEP 2 APPROACH 1: Using just basic R, no packages
# Create empty data frame (0 rows) ACRsample1 with same columns as AltoonaCrimeRates
ACRsample1 <- AltoonaCrimeRates[0,]
OUTPUT (a new data frame is created, also with 39 variables, but 0 observations):
|
# Loop says: “For i going through each unique value of variable Sex in AltoonaCrimeRates,
# (1) make x the subset of AltoonaCrimeRates for which Sex=i (e.g. Sex=M),
# (2) sample this x across all its rows (nrows) down to
# the minimum of the number of rows or 500 rows, whichever is smaller,
# (3) append ACRsample1 with the current sample in x”.
for (i in unique(AltoonaCrimeRates$Sex)){
x <- subset(AltoonaCrimeRates, Sex == i)
x <- x[sample(nrow(x),
min(nrow(x), 500)), , drop = FALSE]
ACRsample1 <- rbind(ACRsample1, x)
}
OUTPUT (Data frame ACRsample1 is created with 1000 observations of 39 variables, 500 observations for each of the 2 unique values of variable Sex – M, F.):
(Also created in the loop are:
|
# Checking: does ACRsample1 have max 500 observations for each value of variable Sex?
table(ACRsample1$Sex)
OUTPUT (variable Sex in ACRsample1 indeed has 500 values F, and 500 values M):
F M 500 500
|
# # # STEP 2 APPROACH 2: Using package “dplyr“
# Load package “dplyr” from the library.
# If package is not loading, first install it using the following command:
# install.packages(“dplyr“)
# If package is loading, but with red error messages, ignore and keep going.
library(dplyr)
# Loop says: “Create ACRsample2 from AltoonaCrimeRates, where
# (1st %>%) you group AltoonaCrimeRates by different values of variable Sex,
# (2nd %>%) and then within a given group sample 500 observations”.
ACRsample2 <- AltoonaCrimeRates %>% group_by(Sex)
%>% sample_n(size = 500)
OUTPUT (Data frame ACRsample2 is created with 1000 observations of 39 variables, 500 observations for each of the 2 unique values of variable Sex – M, F.):
(Nothing else is created in the loop.)
|
# Checking: does ACRsample2 have max 500 observations for each value of variable Sex?
table(ACRsample2$Sex)
OUTPUT (variable Sex in ACRsample2 indeed has 500 values F, and 500 values M):
F M 500 500
|
# Compare the samples obtained from 2 approaches – ACRsample1, and ACRsample2
View(ACRsample1)
OUTPUT (ACRsample1 table shows row numbers from the original data set, AltoonaCrimeRates, and starts with 500 observations where Sex=M):
|
View(ACRsample2)
OUTPUT (ACRsample2 table shows newly defined row numbers, and starts with 500 observations where Sex=F):
|
Challenges:
- How big is your resulting dataset? How many observations does it have? Does variable Sex have more than 500 observations for either its F or M value?
- How should you change the code so that instead of 500 observations 1000 are kept from each sex? Hint #1: make sure to clear your RStudio memory before running the code again – to clear memory, in the Environment tab, click on the little broom symbol, and confirm that you want to clear all objects
from the environment.
Hint #2: one of the 2 approaches will not work with keeping 1000 observations from each sex. Which one works, and which one does not?
- How should you change the code so that instead of variable Sex, you are keeping 20 observations
from each value of variable Offense Code? How many observations does the resulting dataset have? Hint #1 & #2 from above apply again.
Next Page: Previous Page: