RapidMiner Module 3: Merging & Grouping

1/5

Combining datasets for more insight.

Sometimes, we can get more insight by combining multiple data sources together. In this tutorial, we are going to join the Altoona crime dataset with an Altoona population dataset to compare crime statistics to
general population statistics in the area.

ACTIVITY

Drag the already imported Altoona Crime Rates data from the Repository panel into the Process panel.
Import the new, second dataset: Just like before, click Add Data in the Repository panel, navigate to and select the file Lab 1 Data Altoona Population Estimates.csv for import, and store the data as Altoona Population Estimates in your own Local Repository.
Once imported, drag the Altoona Population Estimates data from the Repository panel into the Process panel.

EXPLANATION

Remember, RapidMiner transforms the data into operators (Retrieve Altoona Crime Rates and Retrieve Altoona Population Estimates), but doesn’t load the data until you execute (run) the process.

2/5

Join the data.

ACTIVITY

Search for the Join operator in the Operator panel. Drag Join into the Process panel.
Connect the output ports of data operators (Retrieve Altoona Crime Rates and Retrieve Altoona Population Estimates) to an input port of Join (it doesn’t matter which operator you connect to which input port).
Click the Join operator, then in the Parameters panel de-select use id attribute as key. The key attributes field appears. Click Edit List. Select Month for both the left, and right key attribute. Click Apply.

EXPLANATION

The Join operator will combine the two datasets into a single one, using the Month attribute (column in RapidMiner terminology) in both datasets. When you click on Join, the Parameters panel says “join type: inner”. This means that the joining will only keep those examples (rows in RapidMiner terminology) which have a matching month in both datasets.

Step 2 in the above instructions is important: if you did not connect the operators, Join would not know what data is available, and you would not be able to select Month as the common attribute of the two datasets.

3/5

Group the data.

ACTIVITY

Search for the Aggregate operator in the Operators panel. Drag Aggregate into the Process panel.
Connect the output port of Join to the input port of Aggregate.
Click the Aggregate operator, then in the Parameters panel make the following changes: Select use default aggregation. New options appear. Set attribute filter type to subset. Set attributes to all 7 Adult criminal attributes (Adult Ethnicity Hispanic, Adult Ethnicity Non-Hispanic, Adult Race Asian Pacific, Adult Race Black, Adult Race Native American, Adult Race White, and Adult Total). Set default aggregation function to sum.
Still in the Parameters panel, set aggregation attributes to the following 7 Pop general
population attributes: Pop Ethnicity Hispanic, Pop Ethnicity Non-Hispanic, Pop Race Asian, Pop Race Black, Pop Race Native American, Pop Race White, and Pop Total) – you will need to click Add Entry 6 times to add
6 more attributes to the list, as there is only 1 by default. Set all aggregation functions to mode. Remember to click Apply when done.
Still in the Parameters panel, set group by attributes to Month.

EXPLANATION

If you look at the datasets separately (e.g. in Excel, or just importing them individually in RapidMiner), you can notice that Altoona Crime Rates has a number of examples (rows) for each month showing different types of crime, while Altoona Population Estimates has only one example (row) for each month showing number of people living in Altoona that month. To compare the overall crime population to the overall general population, we group together all crimes in a given month, so that in the end we have only one example (row) for each month in both datasets – this is what Step 5 does.

Now let us say we want to compare the criminal population, and the general population by ethnicity, race, and total number:

Step 3 ensures the relevant examples (rows) from Altoona Crime Rates about ethnicity, race, and total number are collapsed (summed up) for a given month, e.g. instead of having Adult Ethnicity Hispanic individual number for Robbery in January 2013, individual number for Burglary in January 2013, etc., we only have the Adult Ethnicity Hispanic total sum for all crimes in January 2013.

Step 4 ensures the relevant examples (rows) from Altoona Population Estimates are just kept the way they are (by taking the mode in each month), because this dataset already has only one example (row) for each month.

We could have done Step 3 the same way we do Step 4 (i.e. not use default aggregation), but this way (i.e. using default aggregation) we save some time entering commands for Step 3. Unfortunately, it is not possible to use default aggregation for both Step 3, and Step 4, because while Step 3 involves a sum function, Step 4 involves a mode function, and default aggregation requires the same function for all
attributes we use it on.

4/5

Store new combined dataset. Execute the process.

ACTIVITY

Search for the Store operator in the Operators panel. Drag Store into the Process panel.
Connect the output port of Aggregate to the input port of Store.
Click the Store operator, then in the Parameters panel under repository entry store this new dataset as Altoona Combined Data in your own Local Repository.
Connect Store to the result port on the right.
Click Run to execute the process.

EXPLANATION

We are using Store because we will need the newly combined dataset in the next Module.

When you Run the process and look at the Results view, you can adjust the width of the columns same as you would in any other spreadsheet program – hover the mouse around the column border you want to change, and either double-click, or just drag and drop to the desired width.

5/5

Get new insight from the combined dataset.

Congratulations! You just successfully combined two datasets. Remember for future work, it is worth looking at the two datasets separately first. This helps us figure out what attribute (column) to use to match them together, and whether there are any structural differences that need to be taken into account when grouping them, e.g. having many examples (rows) per month in one dataset, vs. having only one example (row) per month in the other dataset.

CHALLENGE

Let’s compare Altoona’s crime rates with the general population estimates over time. In the Results view, go
to the Charts tab, and under Chart style at the top of the tab, choose Series Multiple, which allows you to compare trends between multiple series. Under Plot Series choose two corresponding series from the crime rates, and from the population estimates, e.g. plot mode(Pop Ethnicity Hispanic) and sum(Adult Ethnic Hispanic). Have the two series moved in similar directions over time, or not?
Repeat this for the other 6 corresponding pairs – the crime series always start with sum, while the population
series always start with mode. As you repeat this for pairs that represent the majority of our data (e.g. white criminals & white people, or total adult criminals & total population), do you see a stronger connection between the two series, or a weaker connection? In other words, as different population groups have gone down over time, have those groups’ crime numbers followed the downward trend or not? Have Altoona’s crime rates decreased, increased, or remained constant?
Delete the Store operator for a moment, and connect the output port of Aggregate to the result port. Change Aggregate so that instead of ethnicity and race, you are now only looking at people in age groups 60-64, and 65+ (dropping all other attributes from the process). For which of these 2 groups is it more visible that the
crimes series and population series seem to be moving in a similar direction?