Artificial Intelligence | Lesson 4.8

Comparing Workflows

Workflow of a Machine Learning Project

Machine learning algorithms can learn input to output or A to B mappings. Now the question is, how do you build a machine learning project? In this part, you will learn the machine learning project workflow. As a running example, we use the Square example. Earlier, we talked about how they defined the problem.

Now, let us go through the key steps of this machine learning project. If you want to build an AI system or build a machine learning system to figure out which users are able to pay back the loan. After problem definition, the first step is to collect data. It means you would need to collect the data necessary for your project. We talked about data collection extensively in section 3. In the case of Square, they were collecting data for transaction processing even before defining the AI project. Having the collected data, step two is to train the model. This means you will use a machine-learning algorithm to learn an input to output mapping. In the case of Square, the input will be the information they have collected about the users. And the output is whether or not they have been able to repay the loan in the past. Whenever an AI team starts to train the model, meaning to learn the input-output mapping, what often happens is the first attempt does not work well. So invariably, the team will need to try many times or in AI – we call this iterate many times. You have to iterate many times until, hopefully, the model looks like it is good enough. Meaning it is passing your acceptance threshold. The third step is to then deploy the model. That means you put your machine learning model into action, though only on a small group of test users.

What happens in many AI products is that when you deploy it, you see that it starts getting new data and may not work as well as you had initially hoped. For example, in Square example, a Business to Business (B2B) company with a limited number of transactions is applying for a loan. The transactions are on a scale of million of dollars. But say you had trained your Machine learning system on Business to Customer (B2C) companies with a large number of small transactions. You may realize the system is not working well on B2B companies. When that happens, hopefully, you can get data back from cases such as B2B companies so that the model is as accurate as you were hoping. Use this data to maintain and update the model.

So, to summarize, the key steps of a machine learning project are

(a) collecting data,

(b) training the model, and

c) deploying the model.

Throughout these steps, there is often a lot of iteration, meaning fine-tuning or adapting the model to work better or getting data back even after deploying it, hopefully improving the product.

Refer to Fig 4.6 Caption.

Figure 4.6 | Machine Learning life cycle. It starts with collecting data and then training the model. In training, you might feed the need to go back and collect more data or you may successfully train a model that passes your required threshold. Then you will deploy the model. In this stage, you will get real-time feedback that may require updating your model or going back to data collection and repeating the cycle.

Workflow of a Data Science Project

Data science projects have a different workflow than machine learning projects. Unlike a machine learning project, the output of a data science project is often a set of actionable insights. A set of insights that may cause you to do things differently.

As our running example, let us say you want to optimize your funding website. Say you run a charity website that collects money for enhancing education quality. There is a sequence of steps your donors usually follow. First, they will visit your website and look at the project you have to offer, then eventually, they go to a project page, and then they will have to put it into their shopping cart, go to the shopping cart page, and finally, they need to check out. If you want to maximize the money collected, you must ensure that as many people as possible get through all these steps. How can you use data science to help with this problem?

Let’s look at the key steps of a data science project. The first step is to collect data. On a website, you may have a data set that stores when different users go to different web pages. In this simple example, we are assuming that you can figure out the country that the users are coming from, for example, by looking at their computers’ address, called an IP address, and figuring out the country in which they are originating. But in practice, you can usually get quite a bit more data about users than just what country they are from. The second step is to then analyze the data. Your data science team may have a lot of ideas about what is affecting your charity’s performance. For example, they may think that overseas customers are more likely to donate if the projects are broken into several sub-projects. If that is true, you might think about whether to put time and effort into breaking down the projects. Or your data science team may think there are blips in the data whenever there is a holiday. Maybe more people will donate around the holidays because they feel more generous or maybe fewer people will donate around the holidays because they are staying home rather than sometimes donating from their work computers. In some countries, there may be time-of-day blips wherein countries observe a siesta, so a time of rest like an afternoon rest, there may be fewer donors online and so your donation may go down. They may suggest that you spend fewer advertising dollars during the period of siesta because fewer people will go online to buy at that time.

A good data science team may have many ideas and try many ideas to get good insights. They will distill these insights down to a smaller number of hypotheses.

These hypotheses could be accepted or rejected. They can suggest actions such as breaking projects down into many subprojects rather than having them as a single one. When taking some of these suggested actions and deploying these changes to your project, you start getting new data back as users behave differently. The data science team continues to collect and analyze the new data periodically to see if they can develop new ways to improve performance further.

The key steps of a data science project are to collect the data, analyze the data, and then to suggest hypotheses and actions, and then continue to get the data back and reanalyze the data periodically.

Refer to Figure 4.7 caption.

Figure 4.7 | Data Science life cycle. It starts with data collection. Next, you analyze the data. You may realize you need more data, or you are able to get good insight. Then the management will use this insight and make necessary changes. You may reanalyze the data periodically to see if you can find new insights.