R Module 9: Sampling

[TO BE UPDATED SOON!]

What if you wanted to send someone a smaller sample of your dataset, e.g. 50% of the original dataset? You could sample the original dataset down to this new size.

R Script (copy below code & paste into RStudio; do not copy the output results)

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 9: Sampling

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following file:

# ‘Lab 1 Data Altoona Crime Rates.csv’

# # # # STEP 1: READ IN THE DATA

AltoonaCrimeRates <- read.csv(“Lab 1 Data Altoona Crime Rates.csv”, sep=”,”, header=TRUE)

OUTPUT (a new data frame is created with 2326 observations of 39 variables):

# # # # STEP 2: GET 50% OF THE DATA SET

# Define number as o.5 times number of rows (nrow) in AltoonaCrimeRates.

number <- 0.5*nrow(AltoonaCrimeRates)

OUTPUT (a new value “number” is created, equal to 1163):

# # # # STEP 3: RANDOMLY SAMPLE THE DATA SET DOWN TO 50%

AltoonaCrimeRates <- AltoonaCrimeRates[sample(nrow(AltoonaCrimeRates), number), ]

OUTPUT (the AltoonaCrimeRates data frame is reduced to its new sample size of 1163 observations):

# Checking: See the new samples data set

View(AltoonaCrimeRates)

OUTPUT (you can still see numbers of rows from the original data set in the sample, e.g. row 264, then 701, then 1729, etc.):

Challenges:

Try to build samples of 30% and 80% of the original size. What do you need to change? What are the resulting example set sizes?
Replace the input example set with a different dataset. Do you need to change anything else or will the process execute just fine? Try it!