Project Objective

The project objective was to predict total season touchdowns among college football teams. Data was taken from the 2018 college football season, which was the most recent available. These were all box scores from the games compiled for the season for every team that played Division 1 Football that year. The data was broken down into tables for different aspects of the game by every conference that participates in Division 1 Football, which were ACC, American, BIG 10, BIG 12, MAC, Mountain West, PAC 12, SEC, Sunbelt Teams. Additionally, there were also tables for teams that participate in football and have no assigned conference. These teams are labeled as “Independent” in the dataset. This data was all easily accessible since it was collected throughout the season for every game and for every team to see how the team performed during the game. 

The data was broken down into tables for rushing, receiving, defense, and more aspects of the game. The rushing, receiving, and defense tables were focused on since these are most likely the best predictors of touchdowns for a given team. A touchdown is when the team on offense gets the ball into the other team’s assigned end zone. For the season, this would be a count of the total number of touchdowns that the team scores. The game of football consists mainly of two parts, offense and defense. Offense consists of rushing and receiving and its main objective is to make forward progress down the field to score touchdowns. Rushing plays are where the football is run up the field, usually by either a running back or the quarterback. Receiving is when the quarterback throws the ball to a receiver in order to gain positive yardage. Both of these results in yardage down the field, so for the season statistics around both rushing and receiving are continuous variables. Defense is the stoppage of the other team making forward progress and includes interceptions, sacks, and yards allowed. Interceptions are the team on defense catching the ball thrown by the team on offense and sacks are tackles of the quarterback behind where the original play started. Yards allowed are the number of yards the opponent’s team gains while the specified team is on defense, either from rushing or receiving.

The main idea behind this project is that there are many factors that help to win a football game. Also, data is collected for every NCAA college football game played, allowing for analysis and machine learning oppurtunities. It helps to be outstanding in a given area of the game, but you need good rushing, receiving, and defensive numbers in order to win football games and to score touchdowns. By integrating the data, there should be the oppurtunity to predict with higher accuracy the number of touchdowns a team scores than just with one of these categories of statistics. XGBoost and Multivariable Linear Regression will be used to do the above task, which will be discussed further in future sections. Also, aggregate models, or combining multiple models together, will be used to further improve performance. Additionally, leave-one-out cross validation is utilized as the input data has few records. 

Related work was analyzed to see what had been done in the past. The majority of the football predictors mainly predicted the outcome of the game, including the paper “Sports Data Mining: Predicting Results for the College Football Games” by Carson K. Leung and Kyle W. Joseph. That specific paper focuses on creating a predictive algorithm that takes into account historical data for predicting the winners of games, including the team’s strengths and weaknesses. Also, the paper tries to make up for the lack of current data that comes with college teams. Another paper that had a similar project was “Prediction of Football Match Score and Decision Making Process” by Lalit Kumar Teli, Nilay Zaveri, and Pramila Shinde. This paper does try to predict the scores of football matches, but after further reading it is actually soccer goals they are trying to predict. This paper had a similar objective in trying to take data from matches and output predictions. However, all aspects of the game like score, winner, starting players, and more were also utilized. They also predicted on multiple data sources, but did it to compare the accuracy from them. Hence, predicting touchdowns for college football is something unique compared to past work dealing with integrated models. It is also important to note, after reading the first paper mentioned, that college football is unpredictable due to the short playing career of each player and the changing of teams and their roster so frequently. Therefore, someone would run into the same problem of having to figure out how to account for these issues if they would want to make a future predictor that could be used to predict touchdowns at any time with little adjustment.