1/4
Normalize the dataset.
Sometimes, our models (e.g. decision tree) may give more dramatic results due to outliers – unusual cases in the data that deviate significantly from the majority of the dataset. We may want to remove the biggest outliers from our data, and see what conclusions hold for the remaining majority of the dataset. For instance, our Altoona crime data may have a few examples where crime was significantly higher than usual. We want to remove those examples, and see what conclusions hold for Altoona’s crimes without those outliers. That is what we do now.
|
EXPLANATION |
|
|
|
|
We normalize the data, because in the next step we will be using a distance-based operator to detect outliers. You should always normalize before using distance-based algorithms. Normalization changes all attributes to the same scale so they can be compared more easily. In the case of Z-transformation
|
2/4
Detect the outliers.
|
ACTIVITY |
|
|
|
|
|
|
EXPLANATION |
|
|
|
|
The Detect Outliers (Distances) operator will identify the 10 examples which are farthest away from all others and mark them as outliers. It creates a new column named outlier with true as the value for the 10 outliers
As the operator’s name says, we are performing a distance-based outlier detection. It calculates the Euclidean
|
3/4
Filter out the outliers.
|
EXPLANATION |
|
|
|
|
Here, we filter out examples that were found to be outliers by keeping only those examples that were not outliers, i.e. for which the newly created outlier attribute is false.
The process might run for some time (because the Detect Outliers operator checks each pair of distances separately for all examples, in all attributes) but will switch to the Results view automatically when it is finished. You will notice that the result is a data set with 2316 examples – the 10 outliers have successfully been removed.
|
4/4
Practice your outlier detection, and see its results on the
model.
Congratulations! You successfully removed the 10 biggest outliers from the dataset. See challenge questions below to practice your outlier detection a bit more, and to see its results on the model you used earlier in this Lab.
|
CHALLENGE |
|
|
|
|
|
Next Page: Previous Page: