RapidMiner Module 7: Pivoting & Advanced Renaming

1/4

Aggregate the data by attributes of interest.

In this tutorial, we pivot the Altoona crime data from long table to wide table format, then use advanced renaming of the attributes to make the final outcome neater. Pivoting rotates the data from a long table format (one attribute with a lot of examples) into a wide table format (lots of attributes with a single example). This transformation is especially useful after an aggregation since the aggregated information is always stored in a long format, while machine learning models often need the data to be stored in the wide
format.

For example, we can use pivoting to show how many months in our dataset each sex committed each type of crime, and whether both sexes commit all types of crime or not.

ACTIVITY

Drag the stored Altoona Crime Rates into the Process.
Add the Aggregate operator. Connect it.
Click on the Aggregate operator, then in the Parameters panel set: aggregation attributes to Sex (for aggregation attribute) and count (for aggregation functions); group by attributes to Offense Code and Sex.
Connect Aggregate operator to res port, and click Run to execute the process.

EXPLANATION

The Results view shows you a table with three columns: Offense Code, Sex, and count(Sex), which counts how many months in the dataset had this offense code committed by this sex. For instance, 01A-Murder and Non negligent Manslaughter has been committed by F (female offenders) in exactly 2 months in the dataset. All
combinations of the offense code and sex are represented as rows (examples) which is why we call this a long
table format (because rows are “long”).

2/4

Pivot the data from long to wide format to allow
side-by-side comparison.

Next, we transform the long table format into wide so that we can compare male and female offenders side-by-side.

ACTIVITY

Return to Design view, and disconnect operator Aggregate from the “res” port.
Add the Pivot operator. Connect it.
Click on the Pivot operator, then in the Parameters panel set: group attribute to Offense Code, and index attribute to Sex.
Connect Pivot operator to res port, and click Run to execute the process.

EXPLANATION

The Results view shows you a table with less rows than before. Now each single row represents one crime from the Offense Code attribute (the group attribute parameter). On the other hand, F and M, the two different values from the Sex attribute (the index attribute) are now the two other columns – and the actual values in these two columns are the counts from aggregation. It is now easy to compare side-by-side: For instance, 01A-Murder and Non negligent Manslaughter has been committed in 2 months by females, and 3 months by males. Even more interesting, we see that the next crime, 01B-Manslaughter by Negligence has never been committed by females in our dataset!

3/4

Rename all new columns for a neater presentation.

The names of the new columns reflect how they were created. If we want to simplify their names, we can use operator Rename to rename each individually, or we can use Rename by Replacing which is handier for replacing multiple attributes simultaneously, and which we demonstrate below:

ACTIVITY

Return to Design view, and disconnect operator Pivot from the “res” port.
Add the Rename by Replacing operator. Connect it.
Click on the Rename by Replacing operator, then in the Parameters panel set: replace what to count((.*)\)_(.*), and replace by to $1 $2.
Add the Store operator. Connect it.
Click the Store operator, then in the Parameters panel under repository entry store this new dataset as Altoona Crime Rates by Sex in your own Local Repository.
Connect to res port, and click Run to execute the process.

EXPLANATION

Your Results view should now have columns that say Sex F, and Sex M.

You might already be familiar with regular expressions (this is the name for the strange parameters we have used for the renaming). The expression you have used for replace what means that you look for something, designated by (.*), between count( and )_ and then something else, again designated by (.*), after the underscore. Each time you use round brackets, you define a new so-called capturing group which you can refer to in the definition of the replacement. Since the round brackets have a special meaning here in the original expression, we need to quote the brackets in the name itself with a backslash. Finally, we can use the capturing groups in the replace by parameter with the dollar sign and the number identifying the group. $1 is for the content of the first group, which always happens to be “Sex”, and $2 is for identifying the second group. Those are the two different sexes “F”, and “M”.

We are using Store because we will need the newly formatted output in the next Module.

4/4

Analyze female and male offenders side-by-side with the new
pivoted table.

Congratulations! You just successfully performed a pivot of the data, and renamed it, making it easy to compare female and male monthly crimes side-by-side.

CHALLENGE

Looking at the resulting table, is there a crime that was never committed by females? What about the males? You can sort the table by columns for a quicker answer. A missing value (“?”) means there were no data points there before pivoting.
What crime(s) were committed the least number of months by males?
Who committed crimes more months on average, females or males?