6 min read
This first recorded argument about the superiority of football teams probably occurred ten minutes after the discovery of pigskin. Before the current college playoff system was created, these discussions were largely perfunctory. But now there is more at stake than ever, because admittance to the playoffs is by invitation only, and the bowl selection committee calls the shots – their deliberations are essentially just extensions of these arguments. Did they get it right? Which schools were left out? We can apply machine -learning, remove the bias and address these questions.
Consider the case of Michigan State and Alabama. The teams are set to play in the Cotton Bowl on December 31 and the winner will play for the national championship – there is a tremendous amount discussion around which team is better.
The teams are in different conferences and did not play each other during the regular season. Each had one loss and they shared no common opponents. The figure below shows highlights from their 2015 seasons (the direction of the arrow signifies a win if it points to the team, a loss otherwise):
The case for Alabama: good wins (beating another ranked team) against LSU and Florida, but a loss to Ole Miss. The case for Michigan State: good wins against Ohio St., Michigan and Iowa, but a loss against Nebraska.
So far, the teams seem about evenly matched. If you go one level deeper, you see that Alabama beat Wisconsin, and Wisconsin beat Nebraska (who beat Michigan State), but lost to Iowa (who lost to Michigan State). This looks like important information but doesn’t seem to give either team the edge. Further, as you examine more and more links among teams, it becomes difficult for the human brain to process the information.
What is needed is a method that can simultaneously examine the entire schedule (of which the diagram is a very small piece) and assign ranks banks on each team’s entire win/loss record.
To evaluate the quality of the committee’s ranks, a famous machine-learning algorithm was applied. PageRank is the name of the method used in the early days of Google to rank Internet search results.
Google doesn’t use PageRank anymore, but there is no shortage of on-line documentation on this algorithm. The Wikipedia page has the basic details, and there are numerous applications to business and science scenarios. Most machine learning packages, such as Apache Spark, have implementations of PageRank.
The essence of this algorithm is as follows:
The interpretation of the ranks is that the “good” teams are those that beat other “good” teams. Losses against “good” teams (and wins against bad teams) don’t significantly affect the ranks. In this manner, the method naturally learns the quality of the conferences (SEC, MAC, etc.).
All of the data and code required to reproduce this analysis is located in this repository.
The scores for this season were downloaded from a sports website. A script was written to transform the data into a matrix comprising the win-loss signals. The PageRanks were created in Octave with the Power Method and are displayed in the table below:
As the only major undefeated team, Clemson is a consensus #1 choice. At #2 and #3, Michigan State and Alabama are reversed from the committee’s rankings; this would not have affected their matchup in the semifinal game. However, the method places Stanford in the #4 slot instead of Oklahoma. This should be upsetting to the Cardinal’s fan base, since only the top 4 teams make the playoff and are eligible to play for the national championship. Looks like a huge miss by the committee!
There were some major differences between the two rankings:
The ranks can be applied to the bowl games to find potential upsets. Here are a few highlights in which the committee and PageRank have the teams ordered differently:
The algorithm produces ranks from which predictions (such as those above) can be generated and evaluated. Correctly identifying upsets will build the case for applying this method to sporting events. One thing is for sure, any remaining uncertainty about which team is better should be resolved by January 11th.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.