Artificial Intelligence | Lesson 1.4
Data
We talked about why data is important to have a high-performance AI. But what actually is data? Take a look at the video below and the impact data has on ML and AI.
How is Data Prepared for Machine Learning? by AltexSoft
Now let us look at an example of a table of data, which we also referred to as a dataset. If your goal is to develop an AI model to estimate the price of houses in an area, you need to have a dataset like Table 2. You may have this data as an excel sheet. To estimate the house prices, you need to choose what the price of a house depends on. You might think the size of a house is the input you need to output the price. The AI model will learn the relation between input and output. So your dataset will have two columns, the input (i.e., the size of the house) and the output (i.e., the price). Obviously, you cannot predict house prices solely based on the size of a house. Hence, you add another column, the number of bedrooms. Now the dataset has two inputs and the output.
Given that table of data, it is up to you to determine the inputs and outputs in line with your use case. As another use case, you may have a certain budget, and you want to decide what size of house you will be able to afford. In that case, input A is how much someone is willing to spend, and B is just the size of the house in square feet. That would be a totally different choice of input and output that tells you what size of house you should be looking at, given a certain budget.
| Row Number | Size of House | Number of Bedrooms | Number of Bathrooms | Price |
|---|---|---|---|---|
| 1 | 1500sqft | 3 | 1.5 | $150k |
| 2 | 2000sqft | 2 | 2 | $170k |
| … | … | … | … | … |
How to get Data?
We now know what data is, and why data is important. But how do you get data? How do you acquire data? There are several methods. One way to get data is manual labeling. For example in our house price predictor, the real estate agency naturally records the houses’ features and the selling price for legal reasons. The same data can be used for the AI model. If you are interested in learning how to obtain publicly available data sets, this topic is covered in detail in the POLICY short course, within the LEAP for PIT series.
Download from Websites
The most common method of acquiring data is downloading it or getting it from the internet. Many websites freely share different datasets, ranging from financial data to world health datasets to images of animals. If the data you need for your AI application is available online, you can download it off the web while keeping licensing and copyright in mind. That is a great starting point for developing your AI application. Acquiring the data suitable for your AI project is a challenging process. Whether you are collecting your own data or are using other publicly available data sources, there are quite a few things you need to consider. We briefly address some of them here. To become familiar with what a data set looks like, and how to go about with data exploration, take a look at the video below which uses a real-life data set as reference.
How to Do Data Exploration by Misra Turp
Data: From the Information Technology (IT) Department to the AI Team and Back
Depending on the size of an organization, there may be an individual, team, or department supporting Information Technology (IT). This IT team is charged with establishing, monitoring and maintaining information technology systems and services. The IT team may comprise data analysts, network architects, computer systems engineers, and hardware experts, among others. These professionals help maintain the organization’s digital infrastructure, including technological integration and security. The IT department is responsible for providing the infrastructure to enable the flow of information and technological automation to drive efficiency and functionality in the organization’s business, and resolve any technical issues that may arise.
However collecting data is not enough for a successful AI application. Data is sometimes easy to misuse. Below we describe two of the most common mistakes. In many organizations, you might hear “Hey, give me three years to build up my IT team; we are collecting data. Then after three years, you will have this perfect dataset, and then you can use AI.” This is not a great strategy. Instead, it would be more beneficial that once you have started collecting some data, you proceed and start showing it or feeding it to an AI team. Because often, the AI team can give feedback to your IT team on what types of data to collect and what types of IT infrastructure to keep on building.
For instance, imagine you are collecting data on manufacturing machines to predict the breakdown time and lower your maintenance cost. The AI team may look at your factory data and say, “Hey, can we collect data from this big manufacturing machine, not just once every ten minutes, but instead once every one minute? If so, we could do a much better job building a preventative maintenance system for the organization.” There is often this interplay of this back and forth between IT and AI teams. Remember, the earlier you get the feedback from your AI team while collecting data, the sooner you can adjust, and the better and more robust your IT infrastructure in collecting data would be.
Misuse of Data: To Over-invest in Data
Unfortunately, In many organizations, someone at the managerial level reads about the success of AI in other industries or how their competitors have, and then says, “Hey, we have so much data. Surely, an AI team can make it valuable.” Unfortunately, this is not always the case. Having more data is usually beneficial. However, just because you have terabytes or gigabytes of data doesn’t mean an AI team will be able to generate value from them magically. In fact, in one extreme case, a company acquired a whole string of other companies in medicine, believing that their data would be very valuable. Now, after a couple of years, the engineers have not found a way to take all this data and generate value out of it. Watch out to not over invest in just acquiring data for the sake of data unless you are also getting the AI team involved to guide you in defining the AI goal, and establishing which data is most valuable and can be used to build an AI model.
Bad Data: Incorrect Labels, Missing Values
Finally, data can be messy. You may have heard the phrase “garbage in garbage out”. It means that if you have bad data, then the AI will learn inaccurate things. For example, let us say you have a dataset of sizes of houses, the number of bedrooms, and the price. You might have incorrect values in your dataset. For example, a house is probably not going to sell for $0.001 or just for one dollar. Or you might have missing values meaning some of the features are not available for all the examples in the dataset. Your AI team or AI support partner, will need to figure out how to clean up the data or deal with these incorrect labels and missing values. Also, they might help you with how to prevent this from happening in future data collections. They have more experience in this area and might know some innovative ways that can change the way you collect data.