RapidMiner Module 8: Handling Missing Values

1/4

Check the metadata for attributes with missing values.

Perform data cleansing to achieve higher data quality.

ACTIVITY

Drag the newly stored Altoona Crime Rates by Sex into the Process.
Click on the created Retrieve Altoona Crime Rates by Sex operator, hover the mouse over its output port: wait for a small window to pop up, and display some metadata about the dataset, then press F3 to pop out the
window. Note the metadata table says that Sex F has 3 missing values.

EXPLANATION

Missing values can be a problem because they distort the computer’s data analysis. In our case, we know that missing values mean there were 0 months in the dataset for a given crime type committed by female offenders. But the computer does not know that. Hence, when calculating the average number of months for all crimes with female offenders, the computer excludes these crime types with missing values from the calculation. As a result, the average is incorrect. We can fix this by replacing missing values with zeros.

2/4

Replace missing values.

ACTIVITY

Add the Replace Missing Values operator. Connect it.
Click on the Replace Missing Values operator, then in the Parameters
panel set attribute filter to single, attribute to Sex F, and default to zero.
Click Run to execute the process.

EXPLANATION

Step 2 says “replace missing values for a single attribute called Sex F with zeros”.

In the Results view we see that the missing values for Sex F have indeed been
replaced by zeros. You can sort the Sex F column in ascending order to verify: all three missing values have been replaced by zeros.

However, the Replace Missing Values operator has also changed the order of our columns, so that the ones affected by the operator (in this case, just the column Sex F) have been moved to the beginning of the table. We would like to reorder our table back to its original order, where the Offense Code column was first. We do
that next with the Reorder Attributes operator.

3/4

Return attributes to their original order in the table.

ACTIVITY

Return to Design view, and disconnect the Replace Missing Values operator from the “res” port.
Add the Reorder Attributes operator. Connect it.
Click on the Reorder Attributes operator, then in the Parameters panel set sort mode to user specified, and attribute ordering to Offense Code, Sex F, Sex M (in that order).
Click Run to execute the process.

EXPLANATION

In step 3 above the user specifies that the order of attributes should be Offense Code, Sex F, Sex M.

4/4

Inspect the changes in data.

Congratulations! By handling the dataset’s missing values, you performed data cleansing, and thereby achieved higher data quality. Note: This approach with replacing missing values works when we know what those missing values should be. When the missing values are unknown to us too, handling
missing values can take the form of removing the rows (examples) or columns (attributes) with those missing values.

CHALLENGE

One of the reasons we handled missing values was because with the missing values, the computer was calculating the wrong average number of months for all crimes with female offenders. That average was found to be around 30 (30.719) in Module 7. What is the actual average now, with no missing values? Was the original average underestimated or overestimated?
Update the process design so that missing values are replaced by the average rather than by the zero. What would be the answer to question 1 in this case?