• Log In
 Visit the Pennsylvania State University Home Page

Data Science Tools

  • Home
  • About
  • Data Exploration
    • Lab 1 RapidMiner Modules
      • RM Module 1: Accessing Data
      • RM Module 2: Filtering & Sorting
      • RM Module 3: Merging & Grouping
      • RM Module 4: Creating & Removing Columns
      • RM Module 5: Changing Types & Roles for Modeling
      • RM Module 6: Normalization & Detecting Outliers
      • RM Module 7: Pivoting & Advanced Renaming
      • RM Module 8: Handling Missing Values
      • RM Module 9: Macros & Sampling
      • RM Module 10: Looping & Branching
    • Lab 1 Tableau Modules
      • T Module 1: Accessing Data
      • T Module 2: Filtering & Sorting
      • T Module 3: Merging & Grouping
      • T Module 4: Creating & Hiding Columns
      • T Module 5: Predictive Modeling
    • Lab 1 R Modules
      • R Module 1: Accessing Data
      • R Module 2: Filtering & Sorting
      • R Module 3: Merging & Grouping
      • R Module 4: Creating & Removing Columns
      • R Module 5: Predictive Modeling
      • R Module 6: Normalization & Detecting Outliers
      • R Module 7: Pivoting
      • R Module 8: Handling Missing Values
      • R Module 9: Sampling
      • R Module 10: Looping
  • Machine Learning
  • Data Cleaning
  • Text Analysis
  • Help

RapidMiner Module 8: Handling Missing Values

1/4

Check the metadata for attributes with missing values.

 

Perform data cleansing to achieve higher data quality.

 

 

ACTIVITY

 

 

 

 

  1. Drag the newly stored Altoona Crime Rates by Sex into the Process.
  2. Click on the created Retrieve Altoona Crime Rates by Sex operator, hover the mouse over its output port: wait for a small window to pop up, and display some metadata about the dataset, then press F3 to pop out the
    window. Note the metadata table says that Sex F has 3 missing values.

                                                                                     

 

 

EXPLANATION

 

 

 

 

Missing values can be a problem because they distort the computer’s data analysis. In our case, we know that missing values mean there were 0 months in the dataset for a given crime type committed by female offenders. But the computer does not know that. Hence, when calculating the average number of months for all crimes with female offenders, the computer excludes these crime types with missing values from the calculation. As a result, the average is incorrect. We can fix this by replacing missing values with zeros.

 

 

2/4

Replace missing values.

 

 

ACTIVITY

 

 

 

 

  1. Add the Replace Missing Values operator. Connect it.
  2. Click on the Replace Missing Values operator, then in the Parameters
    panel
    set attribute filter to single, attribute to Sex F, and default to zero.
  3. Click   Run to execute the process.

 

 

 

EXPLANATION

 

 

 

 

Step 2 says “replace missing values for a single attribute called Sex F with zeros”.

 

In the Results view we see that the missing values for Sex F have indeed been
replaced by zeros. You can sort the Sex F column in ascending order to verify: all three missing values have been replaced by zeros.

 

However, the Replace Missing Values operator has also changed the order of our columns, so that the ones affected by the operator (in this case, just the column Sex F) have been moved to the beginning of the table. We would like to reorder our table back to its original order, where the Offense Code column was first. We do
that next with the Reorder Attributes operator.

 

 

3/4

Return attributes to their original order in the table.

 

 

ACTIVITY

 

 

 

 

  1. Return to Design view, and disconnect the Replace Missing Values operator from the “res” port.
  2. Add the Reorder Attributes operator. Connect it.
  3. Click on the Reorder Attributes operator, then in the Parameters panel set sort mode to user specified, and attribute ordering to Offense Code, Sex F, Sex M (in that order).
  4. Click   Run to execute the process.

 

 

 

EXPLANATION

 

 

 

 

In step 3 above the user specifies that the order of attributes should be Offense Code, Sex F, Sex M.

 

 

4/4

Inspect the changes in data.

 

Congratulations! By handling the dataset’s missing values, you performed data cleansing, and thereby achieved higher data quality. Note: This approach with replacing missing values works when we know what those missing values should be. When the missing values are unknown to us too, handling
missing values can take the form of removing the rows (examples) or columns (attributes) with those missing values.

 

 

CHALLENGE

 

 

 

 

  1. One of the reasons we handled missing values was because with the missing values, the computer was calculating the wrong average number of months for all crimes with female offenders. That average was found to be around 30 (30.719) in Module 7. What is the actual average now, with no missing values? Was the original average underestimated or overestimated?
  2. Update the process design so that missing values are replaced by the average rather than by the zero. What would be the answer to question 1 in this case?

 

 

 

 

Next Page: RapidMiner Module 9: Macros & SamplingPrevious Page: RapidMiner Module 7: Pivoting & Advanced Renaming

Follow me on Twitter

My Tweets
 Visit the Pennsylvania State University Home Page
Copyright 2025 © The Pennsylvania State University Privacy Non-Discrimination Equal Opportunity Accessibility Legal