1/5
Combining datasets for more insight.
Sometimes, we can get more insight by combining multiple data sources together. In this tutorial, we are going to join the Altoona crime dataset with an Altoona population dataset to compare crime statistics to
general population statistics in the area.
|
EXPLANATION |
|
|
|
|
Remember, RapidMiner transforms the data into operators (Retrieve Altoona Crime Rates and Retrieve Altoona Population Estimates), but doesn’t load the data until you execute (run) the process.
|
2/5
Join the data.
|
ACTIVITY |
|
|
|
|
|
|
EXPLANATION |
|
|
|
|
The Join operator will combine the two datasets into a single one, using the Month attribute (column in RapidMiner terminology) in both datasets. When you click on Join, the Parameters panel says “join type: inner”. This means that the joining will only keep those examples (rows in RapidMiner terminology) which have a matching month in both datasets.
Step 2 in the above instructions is important: if you did not connect the operators, Join would not know what data is available, and you would not be able to select Month as the common attribute of the two datasets.
|
3/5
Group the data.
|
ACTIVITY |
|
|
|
|
|
|
EXPLANATION |
|
|
|
|
If you look at the datasets separately (e.g. in Excel, or just importing them individually in RapidMiner), you can notice that Altoona Crime Rates has a number of examples (rows) for each month showing different types of crime, while Altoona Population Estimates has only one example (row) for each month showing number of people living in Altoona that month. To compare the overall crime population to the overall general population, we group together all crimes in a given month, so that in the end we have only one example (row) for each month in both datasets – this is what Step 5 does.
Now let us say we want to compare the criminal population, and the general population by ethnicity, race, and total number:
Step 3 ensures the relevant examples (rows) from Altoona Crime Rates about ethnicity, race, and total number are collapsed (summed up) for a given month, e.g. instead of having Adult Ethnicity Hispanic individual number for Robbery in January 2013, individual number for Burglary in January 2013, etc., we only have the Adult Ethnicity Hispanic total sum for all crimes in January 2013.
Step 4 ensures the relevant examples (rows) from Altoona Population Estimates are just kept the way they are (by taking the mode in each month), because this dataset already has only one example (row) for each month.
We could have done Step 3 the same way we do Step 4 (i.e. not use default aggregation), but this way (i.e. using default aggregation) we save some time entering commands for Step 3. Unfortunately, it is not possible to use default aggregation for both Step 3, and Step 4, because while Step 3 involves a sum function, Step 4 involves a mode function, and default aggregation requires the same function for all
|
4/5
Store new combined dataset. Execute the process.
|
EXPLANATION |
|
|
|
|
We are using Store because we will need the newly combined dataset in the next Module.
When you Run the process and look at the Results view, you can adjust the width of the columns same as you would in any other spreadsheet program – hover the mouse around the column border you want to change, and either double-click, or just drag and drop to the desired width.
|
5/5
Get new insight from the combined dataset.
Congratulations! You just successfully combined two datasets. Remember for future work, it is worth looking at the two datasets separately first. This helps us figure out what attribute (column) to use to match them together, and whether there are any structural differences that need to be taken into account when grouping them, e.g. having many examples (rows) per month in one dataset, vs. having only one example (row) per month in the other dataset.
|
CHALLENGE |
|
|
|
|
|
Next Page: Previous Page: