Welcome to the 2017 edition of my NCAA tournament guide. Every year Kaggle hosts a data science competition to see who can build the best model for predicting the tournament. The goal of this competition is to produce a win probability for every team in the tournament against every other team. For the matchups that occur, each submission is scored based on how close the probability was to the actual outcome. For example, if I say that Northwestern has a 90% chance of beating Kentucky and they win I only suffer a penalty related to the 10% difference between 100% and 90%. However, in the highly unlikely scenario that Northwestern loses this game I would suffer a much greater penalty related to the 90% difference 90% and 0%.
The analysis I have done is intended to be used for that competition. However, this website is intended to be used to fill out a bracket. Win probabilities cannot fill out a bracket on their own, so I have created tools that will assist you in applying my research to that task. I have added a few features this year, most notably the bracket. This now allows you to view the whole bracket and see all the win probabilities as different teams advance.
The power rankings are simply my ranking of every team in college basketball as measured by their opponent adjusted point differential. Only the top 25 are shown by default but the rest can be toggled by clicking to show more/less. Additionally, each of the tournament teams contain a link to that team's stats page.
The stats page highlights a selection of the opponent adjusted stats that I generated for this analysis. All of the columns are sortable. Only the tournament teams are given in this section and all of their team names link to that team's stats page.
The projections page allows the user to select any two teams in the tournament via drop down menus and compare them side by side. The probability that the first selected team beats the second is given as well as a statistical comparison of the two teams.
The bracket tool gives the entire bracket (be sure to scroll or swipe around as it is very large). Each node in the graph, represented by a team name with seed as well as a team icon is clickable. Clicking the team will imply a win in that round and advance the team to the next round. When there is a valid matchup present the win probability for each team is presented next to the team name. There is a toggle in the upper left corner that allows the user to switch from win probabilities to point spreads.
When a team is clicked that invalidates downstream matchups, that whole stream is cleared from the bracket. Refreshing the page will always reset the bracket to its original state. As the tournament progresses I will update the bracket to reflect the actual results, and this will become the default state. However, you will still be able to manipulate the entire bracket.
|Seed A||Seed B||Expected Win Probability|
Before we get to the math which I'm sure you will all stick around for and read diligently, here are a few notes on how to interpret the various numbers you will see here.
The probabilities are the most straightforward, and important, and can be found in the projections and bracket pages. This is simply the probability that team A will beat team B. There are two important things to note here. The first is that a team with a 90% win probability is not guaranteed to win, and will, in fact, lose 1 in every 10 matchups with that opponent. The second note is around bracket strategy. If you were simply trying to give yourself the best chance to get every game right, you would pick the team with the higher win probability in every game. However, my model is not going to get every game right and neither will anyone else. Your true goal is get more games right than anyone else in your bracket group. To accomplish this you are going to need to pick some upsets, so you should look for matchups where the lower seed has a higher probability than you would expect given their seed. These are the upsets where you will gain an advantage over the rest of your pool. For example, using the chart below and everyone's favorite matchup, a 12 seed should beat a 5 seed 29% of the time, so if you see a 12 seed with a 40% win probability then that is a good pick to gain an advantage in your pool even though that team is not favored (the table to the left gives all the first round seed pairings and you can generate more seed based expected probabilities using the formula 50 + 3* seed difference). My last word about this is that you should also take into account the size of your pool; bigger groups require more upset picks and smaller groups fewer.
The other numbers you will find on the site are my version of some of the basic box score statistics for each team. These are calculated using a recursive algorithm that accounts for opponent strength in preventing the statistic in question. The numbers themselves have no absolute meaning, but are scaled such that the highest score over the last 15 years registers as a 100 and the lowest is a 0. PtDiff, or opponent adjusted point differential, is the most important of these metrics and is, therefore, what I use for producing my power rankings.
Opponent Adjusted Statistics
Before you can start training a predictive model there is a massive amount of data and feature engineering that must be performed. The raw data provided by the Kaggle competition is in the form of box score statistics for every game, both regular season and tournament. However, in order to make a prediction for a game, we need data points for each team in that game that describe those teams' abilities. This where the opponent adjustment algorithm comes into play.
The simplest way we could generate these descriptive statistics for each team is to simply look at their average output across the season. For example, we would describe a team's offense by taking its average points scored across the season. There is a very clear problem with this approach. Applying it to the 2016 season the top 5 offensive teams in the country would be Oakland, The Citadel, Marshall, North Florida, and Omaha. None of these teams even made the tournament. Applying the opponent adjustment algorithm reveals that North Carolina was, in fact, the best offense in the country. The Tar Heels made it all the way to the final.
The algorithm itself is relatively simple but is computationally expensive. The idea is that for every stat we want to adjust we'll give each team a relative score, meaning it has no units or direct interpretability. Every team starts with a score of 0 which reflects our lack of knowledge about the system before the optimization begins. The next step is to generate a score for every team and every game. This score is a reflection of how well that team did with respect to the given stat in the given game. For point differential, which is my metric for overall team quality, the score is calculated using the pythagorean expectation for the game.
The rest of the statistics are absolute meaning that, unlike point differential there is no against portion of the formula. For example a teams ability to produce blocks, ignoring the block they allow. In this case I produce a score using the p value produced from the normal distribution.
After game scores are calculated for every game a team's overall stat score is updated to be the average of the sum of the game score and the opponent's score across all the games in the season. In the case of the absolute stats, the opponent score is their score for preventing that stat, not producing. Then the whole process is repeated until the scores converge. These scores typically end up distributed between -1 and 1. The interpretation is that a team with a score of 0 would be expected to tie an average team in the case of point differential or produce an average amount of a given statistic against a team that is average in preventing it. As the score diverges from 0 there is not as simple an interpretation but it can be take to mean absolute ability.
For simplicity, I have scaled these values to be bounded from 0 to 100. These bounds are simply defined as the minimum and maximum scores observed over the last 15 years. Not only do these scores pass the eye test of elevating power conference teams, discounting teams from smaller conferences, and matching up with experts' opinions, but they also excel when used as inputs to predictive models.
With the data transformed into usable features, I created an input file for modeling that included every matchup two times so that each row in the data set could have each stat for both teams in both orders. Otherwise the data would be biased by what ever system I used to pick which team filled which slots in the data. Next, I needed a framework for testing models in a way that would avoid overfitting but would still produce the most meaningful results. I settled on a tournament cross validation approach. In this system, I separated all the tournament games for a given year and then use all the regular season and tournament games except those to train the model. By repreating this for all the available years, 2003-2016, I get a cross-validation log loss that spans many iterations of model fitting and has around 1000 total games to test on. Additionally, I can see how models perform in different situations. For example, some tournaments have many upsets and some have very few. Different types of models fair better in these scenarios, and it is nice to know how a particular algorithm fares across a wide variety tournaments.
With all the aforementioned work complete, only the fun part remained, testing different algorithms. This year it was a goal of mine to use neural networks. However, as has been my experience across a wide spectrum of data science projects, logistic regression is really tough to beat. Finally, I found a winning algorithm that employed neural nets and it was all thanks to the concept of variance reduction. While logistic regression is deterministic and will give the same results every time, I was having trouble getting consistent results with the neural net due to the stochastic nature of its optimization. While trying many combinations of its hyper-parameters, particularly the shape of the hidden layers, I would see one set beat out logistic regression only to lose those results when I tried again later. This is when it occurred to me that I should try bagging smaller networks. This is the process of bootstrapping the data and using a subset of the features many times and averaging the results of many estimators together. This turned into a robust solution that was my winner... until regression won again.
The last few years I have done this competition I have always wanted to train my model in a way that takes margin of victory into account. However, because the competition requires probabilities as outputs, I have always done this by using the pythagorean expectation as my dependent variable. This approach worked reasonably well but would sometime produce results outside the [0, 1] range and would always lose out to binary classification techniques. This year, for the first time, it occurred to me that I could use a regression based approach to predict point spreads rather than the pythagorean expectation. I could then use the same logic as prediction intervals to generate the probability that the predicted spread was actually greater than 0 (ie winning the game). This simple regression approach immediately outperformed my excessively complex bagged neural nets.
I added some finishing touches that only returned modest gains in model performance, but every bit counts and the gains were real. I performed a rigorous round of stepwise selection on the linear regression that game me the best set of variables that do not produce extensive overfitting. Finally, part of my motivation for trying neural nets this year is that I have had the hypothesis that teams with certain combinations of attributes probably match up particularly well or poorly against teams with other attributes. I thought the hidden layers of the neural network might identify these. While that may have happened when I was using classification, I could never get improvement with regression. Additionally, with over 30 attributes for each team, I could not feasibly test all ~1000 possible interactions between the variables. Even if I could, it was possible the meaningful interactions involved 3 or 4 or 5 variables. I realized the problem here was dimensionality, so I started thinking about how I could use my favorite dimension reduction technique, Principal Components Analysis. Using, PCA I could reduce the feature set to a handful of core components and interact the components with each other. Now I could see how certain team archetypes interacted with each other. As stated before, the gains were modest but real, and I was finally capturing style of play.
That concludes the core of the analysis I completed, but in reality I did much more than what is described here. If you have any questions about concepts I explained here or are curious about something else I might have tried, please feel free to reach out and ask. Thanks for reading!