Results

Overall, eight models were created in total (including the null model). In order to summarize all findings, evaluation metrics were stored as separate variables for each model. These metrics include R2 score, MSE, RMSE, and MAPE. In total, 32 different evaluation metrics were created and stored in a pandas data frame for comparison. This data frame can be viewed in Figure 18.

Fig 18.

Based on R2 Score, MSE, and RMSE, the standard linear regression model performed the best. Under normal circumstances, this would allow conclusions such that a simple regression model was best for this application. However, MAPE is a more appropriate measure as the data involved several outliers. When evaluating the models, accounting for outliers was important as there are few records overall. For the train-test split, only 24 records were included in the test set. Leave-one-out cross validation expands this to allow all records to eventually be in the “test” set, but the master table itself is only 116 records. Therefore, a singular outlier greatly affects the overall evaluation metric score. The histogram in Figure 19 below shows that most teams fall within the range of 20-60 touchdowns per season, but some teams show up as outliers with 70 or even 80 touchdowns in the season.

Fig 19.

Hence, MAPE is used as the evaluation metric that best summarizes model performances for this scenario. Based on MAPE, Min LR & XGBoost with LOOCV performed the best out of the eight models, with a 21.3% MAPE. This model is what’s then considered to be the highest performing out of the eight. Anything under 20% is considered a great MAPE and indicative of great predictability and performance of the model. Therefore, the min aggregation model with linear regression and XGBoost utilizing leave-one-out cross validation performed well in general.

Concerning the aggregate models, it was hypothesized that aggregating linear regression and XGBoost would outperform linear regression or XGBoost alone. However, using max aggregation, the aggregate model performed worse (26.03%) versus the standard linear regression and XGBoost models with LOOCV (22.03% and 25.3%, respectively). Using average and min aggregation did outperform the XGBoost model alone, but min aggregation was the only aggregated method that resulted in improved performance over linear regression with LOOCV (21.3% vs 22.03%). The percentages mentioned above are all included in Figure 18. 

Feature importance was also calculated for the linear regression models as well as the XGBoost models. What was found was that different features showed differing feature importances in both approaches. In Figure 20, the feature importance ranks from XGBoost models are listed. Figure 21 shows these rankings for the linear regression models.

Fig 20.

Fig 21.

From Figure 20, Yds_rec and Avg_rec had more importance in the XGBoost models. The XGBoost model has more evenly-spread feature importances as well when compared to the feature importances of the linear regression models as shown in Figure 21. Rec, which only had a feature importance of around 0.11 in the XGBoost models, had by far the largest feature importance of 1.4 in the linear regression models as shown in Figure 21. Also, the feature importance in the linear regression model proved to be more spread out. 

Along with comparing the performance of all eight models using MAPE and comparing feature importance, performance of models varied based on the conference each NCAA team competes in. Using Min LR & XGBoost with LOOCV, MAPE was calculated for each conference. The result is shown in Figure 22.

Fig 22.

The resulting MAPE evaluation scores in Figure 22 reveal that some conferences were predicted at a higher accuracy than others. For example, the Min LR & XGBoost with LOOCV model found the predictions for the PAC12 conference had a MAPE score of around 14%, while the Independent conference had an MAPE score of over 30%. Therefore, depending on the conference, it seems that the Min LR & XGBoost with leave-one-out cross validation performed almost twice as well when compared to other conferences.

Finally, to visualize the performance of the model with the lowest MAPE score, Figure 23 creates a scatter plot with predicted values on the x-axis and baseline values on the y-axis. A 45 degree-angle line represents when y-axis values are equal to x-axis values. Therefore, dots further away from this line represent less accurate predictions.

Fig 23.

To visualize the distributions of predicted and baseline values, histograms were created on top of one another in order to show how well the distributions match with one another. This is shown in Figure 24, which shows that predicted values follow more of a normal distribution than the baseline values. This is why MAPE was used for earlier analysis, as the baseline number of touchdowns has outliers on the higher end of the range of touchdowns. Hence, even though the Min LV & XGBoost with LOOCV model performed with the highest accuracy, the model didn’t perform well when predicting outliers in the dataset.

Fig 24.