Artificial Intelligence | Lesson 3.4

Evaluating Data

Quality of data matters regardless of how data was acquired. It might be mislabeled, miss some feature values, or all the possible cases are not covered by the dataset (biased) therefore resulting in an inaccurate model.

A checklist of what you should do to ensure the quality of your data set will be presented next to help you better understand what a reasonable data point should look like (your organization and your domain knowledge) and the recurring errors in your area (your experience) while gathering data.

Confirm data format and ranges

You want to ensure that all responses are in the expected format (for example, text or numeric). Next, confirm that data values fall within a reasonable range. For example, for a scale that goes from 1 to 5, all responses should be within that range. Also, as experts, you have more understanding of what a reasonable range for the features is and see your dataset is following the trends you are expecting. Finally, search for and eliminate duplicate data (such as multiple forms completed in error by the same respondent or data accidentally.

Check for data entry errors

Data entry errors happen regardless of how you design your process, so you want to develop a plan to look for these errors systematically. For example, check for the error patterns you have observed before. Correct the errors you find and, if you discover a pattern of errors, check your entire dataset for this pattern and develop a way that prevents this pattern for future data you are planning to collect.

Check for other inconsistencies

Ensure that your dataset contains only data from participants you want to analyze. For example, if the topic of a survey is a particular set of services and the respondent did not receive those services, you should remove that respondent’s survey from the dataset. Sometimes, there is inconsistency within a respondent’s data. For example, it would be inconsistent if a teacher indicated five years of teaching experience but seven years of teaching in her current school. If you discover such inconsistency, return to the original forms to check for data entry errors. If the inconsistency is on the original form and there is no way to deal with the inconsistency, you may have to eliminate all of the inconsistent data from that form. Also, whenever you find such inconsistencies redesign your data collection process to prevent this from happening (for example, define submission errors for such problems if you are collecting data online)

Investigate missing data

Check back against the original data to ensure that missing data is not missing due to a data entry error. Then, review each case and discuss with the AI team if the amount of missing data is going to make your model inaccurate. For example, if more than half of the responses are missing from a survey, you may need to remove that survey from the analysis dataset. With survey data, you may be able to follow up and get responses from participants who skipped several questions. However, in many examples of data gatherings, follow-up is not an option.

Create a data management plan

Your plan should include steps for tracking data, entering data into a database, checking for errors, storing data, and eventually, how and when to dispose the data.

Document all the details of processes and decisions

It is critical to document all rules, processes, and decisions throughout the data management process. Documentation will let you find the root cause of whatever problems you face, and you will be able to update your processes to prevent the problems from reoccurring.

Develop a data tracking system

Before you begin to receive data, develop a data tracking system that allows you to track all of the data you receive, identify respondents who may need additional follow-up, and document missing data. Your tracking system might stand alone or be part of a larger database that will eventually contain all data for analysis.

How much data do I need?

You often see people falling into the trap of thinking that having a lot of data is going to unlock amazing opportunities. But as you will see in the coming sections, data quality is often more important than data quantity.

The amount of data that you need to build an AI product strongly depends on the product itself. So many variables are at play that giving precise rules like “you’ll need data from 10,523 customers to build a 93% accurate churn prediction model” is just not possible in practice. What we can give you are guidelines that can help develop your intuition about the data requirements for the most common types of problems in the business world.

Let us first talk about a project that requires structured data, such as predicting home prices (section 2). You need to consider three factors:

  • Whether the target (the thing you want to predict) is a number (regression) or a choice (classification)
  • For classification problems, the number of classes you are interested in
  • The number of features that affect the target

Let us start with features. An effective way to think about data requirements is to picture that, if you could view your dataset on a single screen, you would want the data to look very thin and tall, as shown in Figure 4. You want many more rows than columns. This is because rows indicate examples, while columns indicate the model’s features to learn from. Intuitively, the more information a model has to learn from (the more features), the more examples it needs to see in order to grasp how the features influence the target (which means more rows).

If you do not have enough examples and too many features, some of the columns might even be useless or misleading! For example, a home-price dataset like the one in section 2. Adding a column with the zodiac sign of the seller is unlikely to improve the accuracy of price predictions. However, ML models cannot draw commonsense conclusions a priori: they need to figure them out from the data. As long as you have enough rows (examples), most families of models are indeed able to do so. However, if you have too little training data, the model will still try its best to estimate how the zodiac sign affects the price.

As an extreme example, imagine that the only $1 million home in the dataset was sold by a Gemini. Surely, this does not mean that price predictions for homes sold by Geminis should be higher than those for homes sold by Aries. Most models will be able to avoid this mistake if the dataset contains many million-dollar villas (a dataset with a lot of examples), because buyers will have many different zodiac signs and their effect can be correctly estimated. However, if you do not have many examples, the model could “think” that the zodiac sign is the driver of the house’s value.

The dataset on the left, which has many examples and a few features (tall and thin), is a good dataset for ML. The one on the right, which has a lot of features and a few examples (short and wide), is not a good dataset for ML.

Figure 3.4 | The dataset on the left, which has many examples and a few features (tall and thin), is a good dataset for ML. The one on the right, which has a lot of features and a few examples (short and wide), is not a good dataset for ML.
Attribution: Zero to AI, Figure 8.4. Nicolò Valigi and Gianluca Mauro. Link to source. All rights reserved.

Let us now go into the specifics of classification or regression problems. In classification problems, you are trying to predict whether an example belongs to two or more classes.

An example, the loan eligibility algorithms of Square in section 2 that were giving customers one of two classes: either eligible for a loan or not eligible.

Assuming you have a modest number of features (say 10), you should budget at least 1,000 examples for each class that you have in your problem. For example, in the loan model that has only two classes (able versus unable to pay back), you might want to plan for at least 2,000 examples. Intuitively, the more classes the model has to deal with, the more examples it will need to see in order to learn how to distinguish all the classes. Keep in mind that this is just an approximation, and your AI team will have a better understanding of what this number should be.

It is harder to give similar rules of thumb for regression models because they can model more-complex scenarios and phenomena. Many regression models are based on time-series data, a special kind of data that describes how measurements or numbers evolve over time. An example you may be more familiar with is financial data (for example, the price of stocks). You can see what time-series data looks like in Figure 5. In time-series data, the number of data points you collect is not as important as the time span along which you collect them.

Suppose you are running a donation-based charity and you want to estimate your budget for the next year. Assume you have the data for every day from 2012 to 2019, which is going to be 365*7 = 2555. Your AI team might have developed an accurate regression model and you are happy with the results and want to plan your budget accordingly.

But can you use this model to predict the donations in 2020? The model does not know that COVID caused a lockdown, and the situation is different because it was only trained on data before the lockdown. A sudden occurrence will require some new data gathering and changes to your model to reflect it. Again, you will need to consult with your AI team to come up with an approach to decide when and how to collect new data and refine the model based upon recent occurrences.

Time-series data looks like a stream of measurements taken over time.

Figure 3.5 | Time-series data looks like a stream of measurements taken over time (for example, temperatures). Values to the right of the plot are newer than those on the left.
Attribution: Zero to AI, Figure 8.5. Nicolò Valigi and Gianluca Mauro. Link to source. All rights reserved.

Now that you have guidelines about how much data you should be thinking about collecting, keep in mind that not all data points have the same importance when training models. Adding especially bad examples might even backfire and reduce the model’s overall accuracy. Therefore, understanding your data is an essential step in using AI.