Selection Sunday’s Not-So-Secret Ingredients: 5 Big Questions on Analytics and the NCAA Tournament
Though it will be an odd sight watching games played in mostly empty arenas, the NCAA Men’s Basketball Tournament will move ahead with games played without fans in attendance as a precaution against the spread of COVID-19.
With March Madness living on, so too does speculation on the identities of the 68 teams headed to postseason play in the tournament. The selection process has become a sophisticated, data-driven affair, with experts in “bracketology” frequently able to perfectly predict the eventual field, which is composed of a mix of automatic berths stemming from conference championships and at-large bids determined by the NCAA’s tournament selection committee.
In an effort to establish an objective measure of tournament worthiness beyond wins and losses, the NCAA has openly embraced advanced analytics in recent years, and publishes a “team sheet” on NCAA teams featuring a variety of metrics and proprietary rankings.
Still, the move toward objective measurement has not quieted critics of the process, especially those whose favorite team may end up outside of the field of 68. Darden Professor Robert Carraway, an avid basketball fan who teaches in Darden’s Quantitative Analysis area, offered his thoughts on selection criteria and the power and limits of advanced analytics.
Like many aspects of sports, the selection process for the NCAA Men’s Basketball Tournament has been revolutionized by big data and advanced metrics. For traditionalists who prefer the “eye test,” what is the case for allowing advanced metrics to play a bigger role in selection?
The eye test is mostly a mirage: We think we see something when it’s not really there. The biases our eyes are subject to have been well documented. Compounding this is our tendency to over-remember successes and conveniently forget failures with past predictions. So, we think we are better than we are.
The only positive thing from the eye test is the extent to which we pick up on things that are not currently measurable. How an individual defensive player reacts to the first step of an offensive player reveals a lot to the experienced eye about how sound a defensive player she or he is, for instance. That might enable one to see beyond a string of “bad luck” outcomes — although advanced metrics do attempt to measure luck, albeit poorly — and conclude that sound individual defense is likely to lead to better outcomes going forward.
In 2019, the NCAA introduced the NCAA Evaluation Tool (NET) ranking as a new ranking metric central to the selection process. What’s your advice to selection committee members — or business managers— who might rely heavily on one metric that seems to provide a clear, objective value ranking?
There has been much examination recently in the academic literature on forecasting concerning the advantage of combining multiple “expert” points of view. Almost any combination of experts outperforms one individual expert’s opinion across a series of games. This is because the combination tends to dampen extreme points of view, arriving at a forecast that is less extreme but more accurate in the long run. Of course, we like exciting forecasts, such as predicting the big upset or ranking lesser known teams more highly; and, we are more inclined to recall situations where bold predictions came true than when they didn’t.
All of this is to say that the NCAA should use a handful of different ranking metrics to arrive at a composite prediction of future performance rather than over-rely on a single metric, such as NET. One thing they should be considering is the correlation among the various rankings. Oddly enough, research has shown that combining two relatively extreme uncorrelated forecasters is better than a single forecaster who outperforms both individually.
It is better to have the combination of two hypothetically poorer forecasters — such as BPI and KenPom, for example — than a single one, such as NET, even if NET is better than BPI and KenPom individually, which, of course, is a fact that has not been established.
Syracuse Coach Jim Boeheim famously ranted about the use of advanced metrics to rank the quality of defense played by individual players, claiming that no statistician could possibly understand who was responsible for giving up a basket in a defense where five players work together as a unit. Did Boeheim have a point? What are the risks of flawed input data creating unreliable analytics?
It is certainly true that the fact that basketball is a team sport can make it difficult to assign blame or give credit to specific individuals. This is not unlike in football blaming a cornerback who gets beat deep when in fact a safety was supposed to rotate over. However, a coach generally knows exactly who is to blame. In this case, not only did Boeheim call out the wrong statistician but he also failed to understand what the metric in question was actually tracking. There are lessons here for coaches, statisticians and fans.
Coaches and fans: Know what is being measured and what it is actually telling you. Statisticians: In your quest to find systematic patterns, make sure the things you are tracking have real meaning and are not simply artifacts of arbitrary data manipulation. Of note, companies and third parties that sponsor prediction contests actually care less about who had the single best forecast and more about how the various top individuals or teams developed their forecasts.
The UVA men’s basketball team found itself in a much different position this season than last, living life on the NCAA Tournament bubble for much of the season before a late-season surge. There can be an incredibly narrow margin among a large group of teams all vying for a final spot. When there’s a close call, how should managers balance data, gut feel and experience?
The difference in metrics for teams on the bubble is probably minute, generally within whatever margin of error exists with the individual and collective metrics. Tweak one input in one metric, and the ranking could easily change. For this reason, I’m not a fan of “automating” the choice among teams on the bubble to a metric. In instances where metrics show essentially identical resumes, I’d suggest a few alternate possibilities: Choose randomly, so that all teams have an equally likely chance of being chosen, or weight teams’ chances based on some metric, similar to the NBA draft lottery.
Or, factor in previous experience on the bubble. For example, if North Carolina State failed to come off the bubble last year, tilt the odds in their favor that they come off the bubble this year.
The NCAA Selection Committee members will have a “team sheet” on Selection Sunday with a multitude of data points, perhaps none more important than four data-driven team rankings (NET, ESPN’s BPI, KenPom and Sagarin). Of the four rankings, which do you think is best and why?
Each of the four ratings is based on a mathematical formula combining different statistics, weighted differentially. There have been a few attempts to compare the actual performance of the various ratings against one another — for example this site reports that KenPom is slightly better than Sagarin at predicting margin of victory in games. To answer this question, I would want much more data on the actual performance of the various methodologies. Unfortunately, they are almost all being tweaked each year and NET is brand new, so when comparing across years, one is generally comparing not the specific algorithms but rather the developers themselves, as they modify their algorithms over time.
Here are Virginia’s rankings as of 10 March on each metric:
NET – 42
BPI – 37
KenPom – 44
Sagarin – 21
Clearly, Sagarin is the best ranking methodology!
More seriously, my previous observations suggest that maybe combining the two extreme rankings — 44 and 21, or 32.5 — might be an interesting overall ranking methodology. Clearly, these two ranking systems measure teams very differently, and a composite ranking would combine two very different points of view.