What Stats Matter, and What Stats Don’t in College Football

A Statistical Analysis using Python, Tableau, Kaggle and the official NCAA dataset to investigate the most important predictive statistic to winning in NCAA Football.

If you’ve been around american football for any amount of time, you have no doubt heard several popular mantras that seem to be repeated mindlessly, and accepted as truth by many. Just a few examples:

  • “Defense wins Championships”
  • “The team that wins the turnover battle wins the game”
  • “Special Teams is the key to winning”

And I’m sure many others. In this study, we will analyze NCAA football data from the 2019 season, using Python, Seaborn, Matplotlib, Kaggle and Tableau to investigate the most important single statistic to winning games.

“You Play to Win the Game!”

The goal of any college football program is quite simple. Regardless of what coaches will say to boosters or fans, their job is win as many games as possible. Or, put in analytics terms:

“The goal of a college football team is to maximize their winning percentage.”

So, what statistic has the highest correlation to winning percentage, and perhaps just as importantly, which statistics have little to no bearing on winning percentage?

Methodology

If you would like to skip to the analysis itself, or just see the results, jump down to the “Results” section.

For this study, we leveraged data from several sources. We utilized a dataset found on Kaggle through the NCAA official stats warehouse. This is a comprehensive list of all team statistics from the 2019 NCAA season, culminating with LSU rolling to a National Championship.

This analysis will be further extended to cover the decade of the 2010’s in a later post. We will use data from Sports-Reference.com.

Extracting and cleaning the data revelaed that the dataset did not have all the fields that we want for analysis. We had to utilize Python to create the fields for statistics that we could analyze:

#For our analysis, we had to separate Team, Conference, Wins, Losses and calculate Win Percentage, Average Rank((Off Rank + Def Rank)/2), and a key statistic called "yard diff per play".  We'll return to this in a bit.

df[['wins','losses']]=df['Win-Loss'].str.split("-",expand=True)
df['wins']=pd.to_numeric(df['wins'])
df1=pd.DataFrame(df.Team.str.split("(",1).tolist(), columns = ['Tm','Conf'])
df.insert(1,'Tm',df1['Tm'])
df2=pd.DataFrame(df1.Conf.str.split(")",1).tolist(), columns = ['Conference','x'])
df.insert(2,'Conference',df2['Conference'])
df.drop('Team',axis=1,inplace=True)
df['winpct']=df['wins'] / df['Games']
df['Conference1']=df["Conference"].replace({"FL":"ACC","OH":"MAC"})
df.drop('Conference',axis=1,inplace=True)
df.insert(2,'Conference',df['Conference1'])
df['Avg Rank']=(df['Off Rank']+df['Def Rank'])/2
df['Yard Diff Per Play']=df['Off Yards/Play'] - df['Yards/Play Allowed']

We ended up with a dataframe that gives us breakdown by conference, some basic calculations and some new statistics:

Dataframe containing values for analysis

Pearson Coefficient

The Pearson Coefficient, or r-value, measures the linear relationship between two variables. Without getting into too much detail here, the Pearson r-value is a number between -1 and 1. The closer the value is to either 1 or -1, the stronger the correlation between the two variables. If the r-value is close to 0, there is no significant correlation. To clarify:

Notice we take the absolute value of r. If the r value would be -1, there would be a perfect negative relationship, meaning when X increases, Y decreases by the same amount. Perfection isn’t found in real life datasets, but any r-value above .7 is considered a strong enough correlation for decisions.

For our case, a negative correlation could be found between average points allowed by a defense and wins. The less points given up, the more games won. That makes sense. On the flip side, there is a positive correlation between points scored and games won. This also makes sense.

The question isn’t whether or not the correlation is positive or negative, but HOW STRONG the correlation is. For this study, we will use the Pearson Coefficient to find the team statistics that MOST HIGHLY CORRELATE TO WINNING PERCENTAGE.

To do this, we will leverage Tableau for visualizations and Python’s Seaborn library for initial analysis.

During initial analysis, we ran several jointplots to calculate the r-value of team ranks vs wins, and we found a much higher value for defense than offense:

Defensive Team Rank correlates much more highly with winning percentage than Offensive team rank, .65 vs. .55 for Offense

After running several jointplots for initial data analysis, we ran a correlation function on the data to calculate the r-value for each statistical category.

Using Tableau, we visualized the data and found some surprising trends.

Results

In discussing the results, we will show three surprising trends.

The Age of Offense

Admittedly, Im a defensive guy. I cut my teeth coaching Linebackers, and have always approached the game from a defensive perspective…so this section suprised me more than it should most casual fans.

In analyzing the results for the most significant statistic, we eliminated relative statistics, such as ‘ranks’. For example, Offensive rank or Defensive Rank of a team. The rank calculates several statistics, then produces a ordered list of teams. For our analysis, we will only look at actual game statistics, in the same manner that sabremetricians look at baseball statistics.

We said earlier that the Pearson Coefficient above .7 is a strong statistical correlation. We eliminated all categories that had a r-value below .7, and we were left with 9 statistics that had heavy correlation to winning percentage. These were (in order)

  • Total Points
  • Touchdowns
  • Points Per Game
  • Avg Points Per Game Allowed
  • Redzone Scores
  • Offensive Td’s
  • Offensive Yards
  • First Downs
  • Redzone Attempts

Of these, the one in blue is the lone defensive category, Avg PPG allowed, with a -.763 correlation. All the categories in red above are offensive statistics.

The top category is Total points, which makes more real-world sense. If you saw two teams, where one scored a significantly higher point total during the season, you would likely infer that the higher scoring team won more games.

What is surprising about these results is the fact that there is only ONE defensive category that seems like it has predictive power. Total Points seems like the strongest indicator of a high winning percentage.

If this doesn’t surprise you, hang tight, there’s more coming.

What Does NOT Matter

Some of the most surprising results came from statistics with a very low correlation to winning percentage. Here we will look at statistics from the 2019 season that flies in the face of some long-held football philosophies.

Is it important to winning the Turnover Battle?

You hear it from commentators, former players, bloggers and Twitter accounts, but is winning the turnover battle have a strong correlation to winning?

In short? No.

In our analysis, turnovers seem to have little correlation to winning games. You would think that a team that protected the ball would win more games than others, but Turnovers lost only had a R-value of .33 (making it statistically insignificant). Surprising?

Perhaps, until you consider that one of the four teams in the CFB playoff last year was a whopping NEGATIVE 8 in season long turnover margin.

Oklahoma was -8 on the turnover margin, yet managed to make it to the CFB playoff

Here’s another way to look at it. Of the 9 Conference champions in FBS football, 3 of them (33%!) were either ZERO or NEGATIVE in their season-long turnover margin.

33% of Conference Champions were either zero or negative in the season long turnover margin.

There were 10 teams that won 9 games or more with a negative turnover margin.

As a coach, I would never tell my players to be loose with the ball, but the point here is that the turnover margin does NOT have a statistical correlation to winning games.

Penalties

A mantra drummed into young coaches heads everywhere is that “good teams don’t commit penalties!” We were firm believers in this mantra too…until we looked at the data.

There was almost no statistical correlation between penalty yards per game and wins. See for yourself.

Do you see a correlation here? We don’t either. LSU committed more penalties than roughly two thirds of the field and still ran away with the title.

Special Teams wins games! (Or does it?)

One of the most surprising findings in the study was the fact that there were no special teams categories that had any significant correlation to winning. The strongest correlation was “Net Punt Return Yards”, but that only had an r-value of .39, making it statistically useless.

Here are the teams color coded by Punt return rank, and plotted by Kick return rank and wins. If you can see a pattern, then you are smarter than us.

No significant correlation between wins and the return game. So much for the idea of ‘excel in all three phases’ mantra?
Apparently, they don’t even practice the return game in Colorado Springs…yet they won 11 games!

Discussion

The statistics were gathered directly from the NCAA and analyzed using Jupyter Notebooks, hosted on Kaggle with python, seaborn, matplotlib and Tableau for the visualizations.

Admittedly, the conclusions drawn here come from a small sample set of one season, and we will continue to study the correlations of various statistics to winning using historical data in the future.

In this study, there were several suprising findings.

First, of the 7 statistics highly correlated to winning percentage, 6 of them were offensive categories. This surprised us (former defensive coaches) a bit, and flew in the face of the long-held mantra of ‘Defense Wins Championships’

Second, we were very surprised to find that turnovers had no real correlation to winning. We knew coming into this study that Oklahoma was a high turnover team, yet made the CFB playoff, but we assumed that was an outlier in a strong pattern. That turned out not to be the case.

Third, many coaches and fans will repeat the idea of ‘all three phases’, and how important special teams performance is to winning. Special team stats from 2019 show that there were many teams who performed poorly in special teams, yet had double digit wins. There was no significant correlation between performance in special teams and winning.

One item that will be covered in our next post is one SINGLE STAT that correlates higher than any that we have seen to winning games in college football. Stay tuned!

Leave a Reply: