In this tutorial, we predict the profile of Altoona offenders most likely to commit a crime based on age, race, and ethnicity. In some programs (e.g. RapidMiner), it is possible to quickly build complex models like decision trees for this purpose. Tableau does not have a simple straightforward way to build a decision tree, so instead we build a set of simple linear regressions between the variable we are trying to predict, Adult Total, and all other relevant variables, to see which variables correlate the most to total adult crimes.
Step 1: Choose dependent & independent variables
We start by opening the “Altoona Crime Rates.tde”. We are using this dataset rather than the combined one, because it has more individual rows of data, i.e. a bigger sample size for our regression.
- Go into Sheet1.
- Drag our dependent variable (i.e. the one we’re trying to predict) Adult Total from Data pane into the Rows shelf.
- Drag all our independent variables into the Columns shelf – this includes all adult variables pertaining to ethnicity, race, age (18 and above), and sex. The easiest way to drag them all at once is to hold down the CTRL key on the keyboard, and select them all, then drag them into the Columns shelf:
Step 2: Choose appropriate frequency of observations
The resulting view shows how Adult Total is related to each variable in the Columns shelf individually, broken down by Sex:
However, we only see one data point in each graph, because Tableau automatically aggregates the data at the highest level of aggregation.
- To choose the appropriate frequency of observations (in our case months), drag Dimension
Month into the Marks card, under Detail, and update its frequency from discrete yearly to continuous monthly (to the second “Month” option, which makes the “MONTH(Month)” bubble turn from blue to green):
Step 3: Add a regression trend line
The resulting view now shows all graphs using monthly data, and the graphs are ready for regression:
1. Drag the Trend Line operator from Analytics pane into the view, and drop it on the Linear Trend option for all columns (happens if you drop it into top box saying “Linear”):
2. In case your version of Tableau doesn’t automatically show confidence bands (the 2 additional lines above and below the trend line that show the 95% confidence interval for the trend line), you will need to add the confidence bands manually: click on the trend line in the first graph (the Marks card will say “SUM(Adult …” for the single variable whose trend line you clicked on); then holding down CTRL key, click on the trend line in the second graph (the Marks card will start saying “All” so that whatever you do next applies to all variables in the graphs):
Now, right-click on either of the two selected trend lines, and in the drop-down menu select “Edit Trend Lines…”:
A new window pops out – select “Show Confidence Bands”, and click OK.
That’s it! You should now see confidence bands on all your trend lines.
Step 4: Interpret the results
The final resulting view shows neatly organized individual simple
regression graphs between Adult Total,
and all other adult variables:
- Just browsing through all the graphs, tighter confidence
bands mean a stronger correlation. - By sex, most female variables seem to be clustering in the lower left corners, and most male variables in the upper right corners, meaning that men usually commit more crimes than women, across other variables (ethnicity, race, age).
- By ethnicity, there is a very strong correlation with Adult Ethnic Non-Hispanic, meaning that when the number of crimes committed by non-Hispanics goes up, the total number of crimes goes up in very similar proportion.
- By race, there is a similarly strong correlation with Adult Race White, meaning that white offenders contribute by far the most to the total number of crimes.
- By age, the strongest correlation seems to occur for Age 25-29, and Age 30-34, but only for men.
Review
This Module showed how to model a number of simple linear regressions for a given dependent variable, and try get insights similar to those achieved in RapidMiner by the decision tree.
Compared to RapidMiner, this approach in Tableau gave us very little new information – we already knew almost everything from summary statistics in Module 1! The only real new information was the last one – that
men aged 25-34 contribute significantly to the overall adult crime rate in Altoona.
We do not get the same depth of insights we would have gotten from a decision tree, like the correlations on level 2 and below. For example, level 1 correlation is that adult crimes fall as crimes by non-Hispanics fall, level 2 correlation would be that this fall in adult crimes will be greater if there are also less crimes by people aged 65+, a result we got from the decision tree by RapidMiner, but which has shown up nowhere here
in Tableau.
This Module shows a drawback of Tableau compared to RapidMiner – while Tableau makes it easier than RapidMiner to perform visual exploratory analysis of the data, it cannot give the same deep machine learning results that RapidMiner can.
Challenges:
Practice what you just learned by answering the following questions.
- What if instead of predicting adult crimes, you want to predict juvenile crimes, using Juvenile
Total as the dependent variable? How would you change the whole model? What do you find to be the main determinants of juvenile crime in Altoona? Is the conclusion similar to adult crimes or not? Hint: do Module 5 on juvenile variables rather than the adults ones – some variables will
remain (e.g. Sex), but many will change (e.g. all the age variables).
Next Page: Previous Page: