• Log In
 Visit the Pennsylvania State University Home Page

Data Science Tools

  • Home
  • About
  • Data Exploration
    • Lab 1 RapidMiner Modules
      • RM Module 1: Accessing Data
      • RM Module 2: Filtering & Sorting
      • RM Module 3: Merging & Grouping
      • RM Module 4: Creating & Removing Columns
      • RM Module 5: Changing Types & Roles for Modeling
      • RM Module 6: Normalization & Detecting Outliers
      • RM Module 7: Pivoting & Advanced Renaming
      • RM Module 8: Handling Missing Values
      • RM Module 9: Macros & Sampling
      • RM Module 10: Looping & Branching
    • Lab 1 Tableau Modules
      • T Module 1: Accessing Data
      • T Module 2: Filtering & Sorting
      • T Module 3: Merging & Grouping
      • T Module 4: Creating & Hiding Columns
      • T Module 5: Predictive Modeling
    • Lab 1 R Modules
      • R Module 1: Accessing Data
      • R Module 2: Filtering & Sorting
      • R Module 3: Merging & Grouping
      • R Module 4: Creating & Removing Columns
      • R Module 5: Predictive Modeling
      • R Module 6: Normalization & Detecting Outliers
      • R Module 7: Pivoting
      • R Module 8: Handling Missing Values
      • R Module 9: Sampling
      • R Module 10: Looping
  • Machine Learning
  • Data Cleaning
  • Text Analysis
  • Help

RapidMiner Module 4: Creating & Removing Columns

1/4

Expand the new combined dataset with new attributes for even
more insights.

 

Once we have combined multiple datasets for new insights, it is possible to get even more insights by creating new columns, and then focusing on those insights by removing some old columns. The new columns may have formulas which explain our data in a new way. The old columns may have data
that is not of interest in the given analysis.

 

In this tutorial, we are going to create & remove columns in our combined dataset to answer the following questions:

  1. What is Altoona’s crime rate, and has it really increased over the period, or not?
  2. Looking at the proportion of different racial and ethnic groups in Altoona’s population, are any groups significantly overrepresented, or underrepresented in recorded crimes?

 

 

ACTIVITY

 

 

 

 

  1. Drag the stored Altoona Combined Data into the Process.
  2. Add the Generate Attributes operator.
  3. Connect all operators.
  4. Click the Generate Attributes operator, then in the Parameters panel under function descriptions
    click on Edit List. A new window opens. Add the following attribute names and function
    expressions
    :

% Crimes by Black and [sum(Adult Race
Black)]/[sum(Adult Total)]

% Pop Black and [mode(Pop Race
Black)]/[mode(Pop Total)]

% Crimes by White and [sum(Adult Race
White)]/[sum(Adult Total)]

% Pop White and [mode(Pop Race White)]/[mode(Pop
Total)]

% Crimes in Population and [sum(Adult
Total)]/[mode(Pop Total)]

 

            You can either copy/paste the above directly into the appropriate fields, or for function expressions you can try to get the same expressions by clicking on the little calculator symbol on the side, then when another new
window opens, find the above attributes in the Inputs list, e.g. for new attribute % Crimes by Black you first look up sum(Adult Race Black), then put a division symbol /, and then look up sum(Adult Total), and you get the first function expression from the above list!

 

 

 

EXPLANATION

 

 

 

 

Remember, attributes are RapidMiner lingo for columns, socoperator Generate Attributes
literally means “create new columns”.

 

The 5 new attributes (columns) we created above measure, respectively, the proportion of black criminals among all criminals, the proportion of black population in total population, the proportion of white criminals among all criminals, the proportion of white population in total population, and the proportion of total crimes in total population – essentially, the crime rate. The last attribute is literally one of the questions we had – Altoona’s crime rate over time.

 

 

2/4

Use new attributes to create even newer attributes.

 

 

ACTIVITY

 

 

 

 

  1. Add another Generate Attributes operator – it will show in the process as Generate Attributes (2).
  2. Connect Generate Attributes (2) to Generate Attributes behind it.
  3. As before, click the Generate Attributes (2) operator, then in the Parameters panel under function
    descriptions
    click on Edit List, and add the following attribute names and function expressions:

Diff Black and [% Crimes by Black]-[%
Pop Black]

Diff White and [% Crimes by White]-[%
Pop White]

As before, you can either
copy/paste the above directly, or you can get the same expressions by using
the little calculator.

 

 

 

EXPLANATION

 

 

 

 

Notice how we are able to use the attributes created by the first Generate Attributes to create new
attributes by the Generate Attributes (2) – this approach can be repeated as many times as needed.

 

The 2 new attributes (columns) we created above measure, first, the difference between the proportion of crimes committed by black people, and the proportion of population composed of black people, and second, the difference between the proportion of crimes committed by white people, and the proportion of population composed of white people. These 2 attributes will help us answer the other question we had at the start – whether certain groups are overrepresented, or underrepresented in recorded crimes. E.g. if in a given month black people crime rate is higher than black people population rate, the difference will be positive, meaning that black people are overrepresented in crimes compared to their proportion in the overall population.

 

 

3/4

Remove unimportant columns to focus on the questions at
hand.

 

 

ACTIVITY

 

 

 

 

  1. Add the Select Attributes operator.
  2. Connect Select Attributes to Generate Attributes (2) behind it, and the result port in front.
  3. Click the Select Attributes operator, then in the Parameters panel set attribute filter type to subset, and
    under attributes
    choose the following: % Crimes in Population, Diff Black, Diff White, and Month.
  4. Click    Run to execute the process.

 

 

 

EXPLANATION

 

 

 

 

The above steps select only 4 attributes (columns) to keep, the ones we need to answer
our questions from the start of this Module – attribute % Crimes in Population tells us Altoona’s crime rate over time, attributes Diff Black, and Diff White tell us whether black, and white people, respectively, have been underrepresented (if negative) or overrepresented (if positive) in the crimes on record when compared to their
overall proportion in the population, and finally attribute Month just provides the dimension of time for the other 3 attributes.

 

 

4/4

Answer more complex questions with ease.

 

Congratulations! You just successfully created new attributes, and removed old ones. Now you can easily answer the two questions from the start, as well as many more:

 

 

CHALLENGE

 

 

 

 

  1. What is Altoona’s crime rate (range), and has it really increased over the period, or not? Use both the Statistics tab, and the Charts tab in Results view to answer. 
  2. Looking at the proportion of black and white racial groups in Altoona’s population, are they significantly
    overrepresented, or underrepresented in recorded crimes? Use the Charts tab in Results view to answer. Hint: in Charts tab use the Series option (not the Series Multiple!) to show both variables on the same graph with the same scale.
  3. Looking instead at the proportion of Hispanic and non-Hispanic ethnic groups, do we see a similar pattern of over- and underrepresentation as with black and white race?

 

 

 

 

Next Page: RapidMiner Module 5: Changing Types & Roles for ModelingPrevious Page: RapidMiner Module 3: Merging & Grouping

Follow me on Twitter

My Tweets
 Visit the Pennsylvania State University Home Page
Copyright 2025 © The Pennsylvania State University Privacy Non-Discrimination Equal Opportunity Accessibility Legal