Data Integration

Data was integrated from sports-statistics.com. As mentioned earlier, many aspects of the game had data recorded, but models were only trained on rushing, receiving, and defense data. All ten NCAA football conferences were utilized and downloaded, with receiving, rushing, and defensive statistics each being separate csv files for each conference. Three aspects of football multiplied by ten conferences equals 30 total data sources. How the three types of statistics were related and eventually combined is highlighted in Figure 1. For integrating tables with different statistics, summing TD was all that was needed. All the raw data tables in a file viewer can be seen in Figure 1.1.

Fig 1.

Fig 1.1.

The aforementioned 30 sources were integrated together using the pandas Python library in a Jupyter Notebook. All receiving, rushing, and defensive statistics were combined into three csv files. For example, all rushing csv files were joined for all ten conferences using the data frame functionality of pandas. Some of the data was not passed in as integers initially, so the raw data had to be cleaned in order to change the data to the correct data types. Figure 2 shows the process mentioned above. For the rushing csv files, “Yds” and “Gain” were the two columns that were originally loaded in as objects in the data frame, which were converted to integer types. This can be seen in the comments of the code in Figure 2. The process highlighted was done for receiving and defensive datasets to account for all 30 data sources.

Fig 2.

Next, Figure 3 shows some exploratory data analysis (EDA) with all of the combined rushing data. 

Fig 3.

The important parts of the EDA are positive or negative relationships between TD (Touchdowns) and all other attributes, such as Yds_rush. The scatterplot on the bottom row with Yds_rush as the bottom axis indicates that as Yds_rush increases (the team rushes more yards throughout the season) the number of TDs a team scores also increases. Similar EDA revealed the same relationships among the combined receiving and defensive data.

After integrating the ten csv files for rushing, the ten for defense, and the ten for receiving, the three larger datasets were combined into one master data frame, from which models could be built on. A concat function from pandas was used to combine the three data frames. The code for this step can be seen in Figure 4.

Fig 4.

Figure 5 shows another SeaBorn PairGrid plotting all the variables against each other for the master data frame. This large matrix is hard to see, but is helpful for seeing if positive or negative relationships are present in the data. Some datasets do not have clear relationships, and it’s much better to know if these relationships are present sooner rather than later. For more fine-tuned exploratory data analysis, a graph of just part of the master dataset is often more useful, as shown in figure 3 for rushing data.

Fig 5.

If the large matrix is too much to look at, another EDA option is to use the pandas corr() function, which produces a number between -1 and 1 to illustrate the correlation between every attribute and every other attribute. Then, the data frame can be searched for numbers close to 1 or -1. This can all be viewed in Figure 6.

Fig 6.

In addition to searching for correlation, pandas can also be used to observe the summary statistics of the master data frame. The describe() function gives us information like total rows, and the info() function gives us information on the columns and data types of the data frame. The describe() function is shown in Figure 7, while the info() function is shown in Figure 9.

Fig 7.

As shown in Figure 7, the “count” row says that there are 116 total rows of data. If any column was missing data, then the count for that column would be less than 116. In this case, no data in the master data frame was missing. The describe function also shows some handy statistical information, like the average total TDs in a season is around 44 touchdowns.

Speaking of average total TDs, the number of TDs made were visualized via a histogram. In Figure 8, some outliers appear on the upper range of touchdowns. This would be important later and is discussed in the Results section.

Fig 8.

Figure 9 shows the info() function used from the pandas library. This function counts the number of columns in a given data frame and also shows the data types of each column. Knowing the data types is helpful because when data is first read in from all the csv files, not all the data types are the same. Some columns had objects and some had numbers. Part of the data integration step was cleaning and wrangling the data to make sure all of the columns had the exact same data type. The info function is a way to double-check that step. The other two aspects of the info function are the index, which gives the first and last row in the data frame, and the size of the data frame. The size of the data frame, 16.3 KB in this case, can be especially important for running models and sharing data with others.

Fig 9.

As another part of the EDA, some multi-bar charts were developed to show the spread of certain attributes with total season TDs. These multi-bar charts were organized geographically, so correlations can be seen by conference. It can then be seen which conferences, if any, had stronger attributes or more total season TDs for the whole conference. To group the data by conference, code was written to group all attributes by conference. The exact code and steps followed are in the next section on data preprocessing, but the resulting multi-bar charts for rushing, receiving, and defense are shown in figures 10-12 below, respectively. 

Fig 10.

Fig 11.

Fig 12.

After the datasets were combined, models were generated. The specifics of the model-generation will be covered in the next section, Data Pre-Processing & Methodology. When generating models, average, max, and min methods were utilized to integrate models together.