RapidMiner Module 5: Changing Types & Roles for Modeling

1/6

Retrieve the data, and note its different types and roles.

In this tutorial, we prepare data for modeling by changing column types and roles, i.e. determining which column (or variable) should be predicted by what other columns (or variables) in our dataset. We then predict the profile of Altoona offenders most likely to commit a crime based on sex, age, race, and ethnicity. In our modeling, we use a classic machine learning method called Decision Tree.

ACTIVITY

Drag the stored Altoona Crime Rates (not the combined dataset) into the Process.
Click on the created Retrieve Altoona Crime Rates operator, and hover the mouse over its output port: wait for a small window to pop up, and display some metadata about the dataset. Note the metadata table, and two of its columns, Role and Type.

EXPLANATION

Type defines the attribute’s (column’s) possible values, e.g. values can be date_time (Jan 1, 2013 12:00:00 AM EST), polynominal (01ABC-4DEF), integer (0, 1, 2, 3, -10), etc.

Role describes how the attribute will be used by machine learning operators. Attributes without an assigned role are by default regular attributes – these are used as inputs by machine learning operators. All of our attributes currently have a regular role. Attributes which we want to predict need to be assigned the role of label attributes (sometimes also called target or class attributes) – these are used as outputs by machine learning operators.

Say we want to predict the total number of crimes committed by adults. In this case, we would choose the attribute Adult Total, and change its role to label. This is what we do next.

2/6

Change the target attribute’s type for predictive modeling by discretizing it.

ACTIVITY

Add the Discretize by Binning operator. Connect it.
Click on the Discretize by Binning operator, then in the Parameters panel set attribute filter type to single (i.e. you only work on a single attribute), attribute to Adult Total, and number of bins to 2.

EXPLANATION

Discretization is a common technique to transform an attribute’s type from numerical to polynominal (a nominal with more than 2 values), a type needed for the machine learning method called Decision Tree which we will be using later on.

“Binning” categorizes each Adult Total example (row) as one of two “bins” (groups) covering equal parts of the overall range of values. Discretization then replaces the original numerical value in each example (row) with the name of the “bin” the value belongs to.

3/6

Change the target attribute’s role for predictive modeling to label.

ACTIVITY

Add the Set Role operator. Connect it.
Click on the Set Role operator, then in the Parameters panel set attribute name to Adult Total and target
role to label.
Connect Set Role operator to res port, and click Run to execute the process.

EXPLANATION

Step 2 above essentially tells the machine (i.e. computer) that attribute Adult Total is the one that needs to
be predicted; all other attributes can be used to help predict it.

In Results view, Data tab, you can see that the column of attribute Adult Total is now in different color
(designating it as the label that needs to be predicted), and that the values of individual examples have been
replaced with “bins” covering the two ranges of all of the attribute’s values, namely range1 [-∞ – 36.500], and range2 [36.500 – ∞]. Switch to Statistics tab, and you also see that above the statistics of Adult Total there is now a sign saying “label”.

4/6

Keep only attributes relevant to predicting your target
attribute.

ACTIVITY

Return to Design view, and disconnect operator Set Role from the “res” port.
Add the Select Attributes operator. Connect it.
Click on the Select Attributes operator, then in the Parameters panel set attribute filter type to subset and attributes to all non-juvenile ones (i.e. exclude all Age attributes younger than 18, and all attributes that start with Juvenile…, include all Age attributes 18 and older, all attributes that start with Adult…, and also Month,
Offense Code, and Sex).

EXPLANATION

We excluded all juvenile attributes because we are focusing on adult offenders only, and keeping the details of juvenile offenses would not help us better predict adult offenses.

5/6

Predict target attribute using a decision tree model.

ACTIVITY

Add the Decision Tree operator. Connect it on both ends (to previous operator, and to “res” port).
Click Run to execute the process.
Inspect the resulting decision tree.

EXPLANATION

In Design view, notice how between different operators the connections are blue until the Decision Tree operator (so-called data connections), and then they are green after that operator (so-called model connections,
because this is where data has been modeled).

In Results view, you can see the resulting decision tree. How to interpret it? Start from the most right tree “branch”, and work leftward. We see that this branch has the thickest arrows, meaning the majority of data is
explained by it. So, the majority of Adult Total crimes falls within “bin” called range1 [-∞ – 36.500] (meaning the majority of months had less than 37 adult crimes), and this is connected with Adult Ethnic Non-Hispanic being overall smaller than 35.500 (meaning there were less than 36 crimes committed by ethnically non-Hispanic offenders in those months), and Age 65+ being smaller than 3.500 (meaning there were less than 4 crimes committed by offenders older than 65 years in those months). On the other hand, when were there more
than 36 adult crimes? A look at all the “branches” where range2 dominates tells us the answer. The most left branch says there were more than 36 adult crimes when there were more than 36 crimes committed by ethnically non-Hispanic offenders. This is not particularly surprising. What is surprising is the other range2 “branch”: that one states that there were more than 36 adult crimes even with less than
36 non-Hispanic offenders if the date was before June 2015. We have not really noticed this trend before, so this is an interesting new insight for this given subgroup of offenders. Of course, the good news is that, even for
that subgroup, adult crimes fell after June 2015, as evidenced by the other
“branch” splitting under Month. So in conclusion, what have we learned about Altoona’s crime rate
predictability? If law enforcement assumes the crimes to continue with a similar pattern, they might want to focus their attention more closely on the ethnically non-Hispanic population in Altoona.

6/6

Practice machine learning a bit more.

Congratulations! We just finished our first machine learning model – a simple decision tree. As datasets become bigger and more complex, machine learning models like the decision tree become more and more useful in quickly giving us insights from the data that we would not have found on our own, or would have needed significantly more time for.

CHALLENGE

What happens to the model if you discretize the dataset into 3 bins instead of 2? Does the decision tree get more branches, or less than the current one?
What happens to the model if you keep discretization at 2 bins, but when selecting attributes drop from the selection all those starting with Adult… so that most selected attributes are just different adult age ranges? Does the decision tree get more branches, or less than the current one?
What if instead of predicting adult crimes, you want to predict juvenile crimes, using Juvenile Total as the label attribute? How would you change the whole model? What do you find to be the main determinants of juvenile crime in Altoona? Is the conclusion similar to adult crimes or not?