• Log In
 Visit the Pennsylvania State University Home Page

Data Science Tools

  • Home
  • About
  • Data Exploration
    • Lab 1 RapidMiner Modules
      • RM Module 1: Accessing Data
      • RM Module 2: Filtering & Sorting
      • RM Module 3: Merging & Grouping
      • RM Module 4: Creating & Removing Columns
      • RM Module 5: Changing Types & Roles for Modeling
      • RM Module 6: Normalization & Detecting Outliers
      • RM Module 7: Pivoting & Advanced Renaming
      • RM Module 8: Handling Missing Values
      • RM Module 9: Macros & Sampling
      • RM Module 10: Looping & Branching
    • Lab 1 Tableau Modules
      • T Module 1: Accessing Data
      • T Module 2: Filtering & Sorting
      • T Module 3: Merging & Grouping
      • T Module 4: Creating & Hiding Columns
      • T Module 5: Predictive Modeling
    • Lab 1 R Modules
      • R Module 1: Accessing Data
      • R Module 2: Filtering & Sorting
      • R Module 3: Merging & Grouping
      • R Module 4: Creating & Removing Columns
      • R Module 5: Predictive Modeling
      • R Module 6: Normalization & Detecting Outliers
      • R Module 7: Pivoting
      • R Module 8: Handling Missing Values
      • R Module 9: Sampling
      • R Module 10: Looping
  • Machine Learning
  • Data Cleaning
  • Text Analysis
  • Help

R Module 9: Sampling

[TO BE UPDATED SOON!]

What if you wanted to send someone a smaller sample of your dataset, e.g. 50% of the original dataset? You could sample the original dataset down to this new size.

 

R Script (copy below code & paste into RStudio; do not copy the output results)

 

# # # # # NSF Project “Big Data Education” (Penn State University)

# # # # # More info: http://sites.psu.edu/bigdata/

# # # # # Lab 1: Altoona Crime Rates – Module 9: Sampling

 

# # # # STEP 0: SET WORKING DIRECTORY

# Set working directory to a folder with the following file:

# ‘Lab 1 Data Altoona Crime Rates.csv’

 

# # # # STEP 1: READ IN THE DATA

AltoonaCrimeRates <- read.csv(“Lab  1 Data Altoona Crime Rates.csv”, sep=”,”, header=TRUE)

 

OUTPUT (a new data frame is created with 2326 observations of 39 variables):

 

 

 

# # # # STEP 2: GET 50% OF THE DATA SET

# Define number as o.5 times number of rows (nrow) in AltoonaCrimeRates.

number <- 0.5*nrow(AltoonaCrimeRates)

 

OUTPUT (a new value “number” is created, equal to 1163):

 

 

 

# # # # STEP 3: RANDOMLY SAMPLE THE DATA SET DOWN TO 50%

AltoonaCrimeRates <- AltoonaCrimeRates[sample(nrow(AltoonaCrimeRates), number), ]

 

OUTPUT (the AltoonaCrimeRates data frame is reduced to its new sample size of 1163 observations):

 

 

 

# Checking: See the new samples data set

View(AltoonaCrimeRates)

 

OUTPUT (you can still see numbers of rows from the original data set in the sample, e.g. row 264, then 701, then 1729, etc.):

 

 

 

 

 

Challenges:

 

  1. Try to build samples of 30% and 80% of the original size. What do you need to change? What are the resulting example set sizes?
  2. Replace the input example set with a different dataset. Do you need to change anything else or will the process execute just fine? Try it!

 

 

 

Next Page: R Module 10: LoopingPrevious Page: R Module 8: Handling Missing Values

Follow me on Twitter

My Tweets
 Visit the Pennsylvania State University Home Page
Copyright 2025 © The Pennsylvania State University Privacy Non-Discrimination Equal Opportunity Accessibility Legal