Data Pre-Processing & Methodology – Using Team Statistics from 2018 NCAA Football Data to Predict Number of Touchdowns: A DS 320 Final Project

To generate the multi-bar charts from the Data Integration section, a function was written to pick out all the conferences from the 30 csv files. The same conferences were repeated in the 10 defense, 10 rushing, and 10 receiving csv files, so only one loop through one of those three groups was required. Defense was chosen arbitrarily. Next, all the unique instances of conference names were appended to an object called conference_df. The full code is shown below in figure 13.

Fig 13.

So, at this point a list of conferences and teams is generated. Great! Now, to bring in the avg_rush and other features, the pandas merge() and groupby() functions to perform a left join. A left join means everything is held the same on the first (left) data frame and then the next (right) data frame is slapped onto the first. This process was done three times to get rushing, receiving, and defense data frames with a conference column included. In data science terms, a left join was performed with the conference_df to all the three main data frames and then grouped by conference. This is shown below in figure 13.1.

Fig 13.1.

Next, everything is ready to visualize the data in a multi-bar chart. The below code in Figure 13.2 is not a lot of lines, but line 3 is critically important. Line 3 sets the dimensions of the multi-bar chart, so if those numbers are not tuned correctly then the graph will not show up. Line 3 generates the multi-bar chart object from MatPlotLib’s data science library according to the data frame fed into it. The data frame fed is the object that comes before the “.plot()” method. The resulting multi-bar charts are shown above in the Data Integration section.

Fig 13.2.

Once done with the exploratory data analysis (EDA), model building began. Three models were built. A null model, an XGBoost regression model, and a multi-variable linear regression model were decided upon for their necessity, reliability, and flexibility. The null model is necessary to confirm that the more sophisticated models outperform random guessing. The code for developing the null model is shown in figure 14. The idea behind the null model in this case is to predict that every team will score the average number of season TDs, which was 44 in 2018. Then, a vector of “predictions” is generated, but the vector is composed of only the number 44. Next, the chosen metrics are run to compare the null model prediction to the ground truth and the error is output. The errors are also shown in figure 14.

Fig 14.

After the null model was established, the XGBoost regression model was constructed. The code details are shown in figure 15.3. The xgboost python library does most of the work, but properly formatted data needs passed into the model. From the corr() function, the strongest correlated features with total TDs were identified as Yds_rec, Avg_rec, Yds_def, Rec, Yds_rush, Avg_def, and Gain in that order. So, first the master dataset was pruned to only include those variables, which had a positive correlation of 0.2 or greater. This pruned master data frame would be used later for Linear Regression as well. Fig 15.1 shows the code used to prune the dataset. The pandas syntax of double hard brackets was used. Next, the X and Y datasets had to be made from the pruned master data frame, and the pruned master data frame had to be coerced into a numpy array for ingestion by xgboost. These steps are performed in fig 15.2. The results of the xgboost model are shown in the results section.

Fig 15.1

Fig 15.2

Fig 15.3

After the xgboost model, the multi-variable linear regression model was developed. First, the TD column, the predictor, was dropped from the dataset and the remaining columns were called x, and the TD column was called y, the predictor. Then the sklearn train_test_split() and LinearRegression() functions were used. The train_test_split() function was the same as what was used for the XGBoost model. Figure 16.1 shows the code and workflow of the Linear Regression, and Figure 16.2 shows the code to run the predictions. The actual error metrics are reported in the Results section.

Fig 16.1

Fig 16.2

After the multi-variable linear regression model, the models were integrated with one another to improve performance. The XGBoost and linear regression models were integrated according to min, max, and mean methods. In addition, Leave One Out Cross Validation (LOOCV) was utilized in the XGBoost and linear regression individual and combined models. LOOCV means one row of test data is set aside, and a model is trained on all remaining data. This train step is repeated for every row of data in the dataset. In this way, every row of data gets one turn being left out for testing, and a model can often better be more robust to new data. Figure 17.1 shows the code and steps for XGBoost LOOCV and Figure 17.2 shows the code and steps for linear regression LOOCV.

Fig 17.1

Fig 17.2

All models – Null, XGBoost, and multi-variable linear, and integrated models – were scored and evaluated using R2, Root Mean Squared Error (RMSE), and Mean Absolute Percent Error (MAPE) metrics. The MAPE error was particularly important because MAPE works well with outliers, which the dataset had plenty of, as shown in the Data Integration section. The final results and errors of the models are laid out in the Results section.