Conclusions – Using Team Statistics from 2018 NCAA Football Data to Predict Number of Touchdowns: A DS 320 Final Project

Based on the results section, the conclusion was that aggregating the models did in fact improve the accuracy of model predictions over the null model. This included all of the XGBoost and Linear Regression models and the variations and combinations of them. The accuracy scores, in general, improved regardless of which prediction method was used. Statistics can be used to predict the number of touchdowns a team will score in a season with more accuracy than a null model.

Using the MAPE accuracy score, which was the best accuracy score for the size of the data set utilized while accounting for the presence of outliers, Minimum Linear Regression and XGBoost with Leave One Out Cross Validation was found to have the best accuracy score of all the models. When the models were combined, most of them beat the standard Linear Regression or XGBoost methods. But the maximum aggregate Linear Regression and XGBoost model performed worse than the standard XGBoost model and the Linear Regression with Leave One Out Cross Validation. The average aggregate model performed worse than the latter predictor as well.

It was also found that MAPE scores varied from conference to conference, with the PAC 12 having the best accuracy and the independent teams having the worst. This would make sense since independent teams play a variety of teams and not the same set of teams like the actual conference teams, which would make it hard to train an accurate classifier due to the differences of schedules and strengths and weaknesses of opposing teams.

So in general, the goal of training accurate predictors based on both integrated data and integrated models was accomplished. The data was integrated based on teams, where all the data was together and rows were teams. The work was done with Linear Regression and XGBoost and validated with Leave One Out Cross Validation and accuracy measure of MAPE. Integrated models using min, max, and average functions proved that integrated models can improve performance.

In the future, it would be useful to account for several other models. Linear regression and XGBoost are relatively simplistic models, and the addition of more sophisticated models may improve performance. Also, aggregating more sophisticated models with those mentioned in this paper may even further improve performance. Also, rushing, receiving, and defense data were the only statistics utilized. Adding additional statistics and integrating even more data may improve performance as well. Overall, there’s a great deal of research opportunities that can be explored in terms of data integration on the data table side as well as aggregating models.

References

[1] Lalit Kumar Teli, Nilay Zaveri, Pramila Shinde,“Prediction of Football Match Score and Decision Making Process ”, International Journal IJRITCC, Vol. 06, Issue 2, PP 162 – 165.

[2] Carson K. Leung, Kyle W. Joseph, “Sports Data Mining: Predicting Results for the College Football Games”, Procedia Computer Science, Vol. 35, 2014, PP 710-719.