March “Machine Learning” Madness Predictions

David Nichols

March Madness is by far my favorite time of the sports calendar year. The time when the top 68 college basketball teams in the country compete for the national championship in a two-week long, single-elimination tournament. For many of us, this is the opportunity to pick a few brackets and put our basketball IQ to the test. However, advances in machine learning and generative AI have added another layer and the ability to build a “better” bracket has never been easier.

Harnessing the Power of Generative AI and Machine Learning

From a “winning” bracket perspective a “perfect” bracket has never been picked, and likely never will. The odds are actually 1 in 147 quintillion. Instead, winning brackets are often ones that avoid being “busted”. Meaning, they can identify the top 4-6 teams that will go deep into the tournament. It isn’t necessarily about winning the war, but the battles. With this knowledge and perspective, I set off on a project to set up a basic machine-learning model to help identify the top teams in the 2024 tournament.  The model I’m using doesn’t necessarily designate a winner but instead displays a percentage of confidence that a given team will beat another, taking into consideration larges amount of previously “disregarded” data such as neutral site shooting percentages, passing percentages, previous tournament success, and more.

 

Background on the NCAA Basketball Tournament

The first thing we need is the rankings and regional brackets. “Selection Sunday” is held the week before the tournament kicks off when the committee gets together and ranks all the teams and assigns them to a region. The regions are simply for venues and have nothing to do with what conference a team plays in or where they are located. There are 4 regions total with 16 teams in each. The teams are then ranked in their region, #1 through #16, and play in that order with the #1 vs. #16, the #2 vs. #15, and so on. As the rankings get closer and the tournament goes to additional rounds, the games get more and more difficult to predict. The 1st round #5 vs. #12 is traditionally the hardest game to predict with the most upsets.

 

For this exercise, we’ll be using a data set for NCAA Men’s Basketball from Kaggle.com. This data set is large and includes all team and game data from 2003 onward. That’s 20+ years of usable data including numbers like wins/losses, shooting percentages, road/home/neutral site win percentages, 3-point shooting percentages, free throw shooting, and more. Using a template, we’re able to add all of our data to the spreadsheet, insert the rankings by region, and then apply the formula to each region on a round by round basis to show which team has the higher win probability. Now that we have a little background on the tournament and the process we’re using, let’s go through the bracket by region and see what winners our model has predicted for each.

East Region – Winner: Uconn

The East region may be the “easiest” to pick simply because of the teams. With defending National and Big East Tournament champion Uconn in the region, there really aren’t any other teams that can match up size and intensity and the data model backs this claim with Uconn predicted to easily win each of its match ups. The model also surprisingly picked 1st round upset wins for Yale and South Dakota, two teams that lack efficient scoring but have good experience and a strong coaching staff, factors that aren’t usually calculated in a “traditional” prediction model.

South Region – Winner: Kentucky

The South region is a little tighter analytically and has a lot more games that the model had difficulty picking. In particular, the top 4 teams in this region, Houston, Marquette, Kentucky, and Wisconsin are all very close with small factors separating their win and loss percentages. The model eventually picks Kentucky upsetting both Marquette and Houston on the way to the top of the region. Other callouts include NC State upsetting Texas Tech, likely due to end of the year schedule and momentum coming off their recent ACC tournament championship.

Midwest Region – Winner: Creighton

Creighton is the surprise winner of the Midwest teams. The Midwest is another tight region that the model predicts will see a lot of close games. If you enjoy hard, close basketball games this might be the region to tune into for the 1st round. With the fewest amount of upsets, fans might have to wait until the 2nd round of the Midwest region to see Gonzaga topple Kansas. Other notably games include the regional championship, which the model predicts will pit Creighton vs. Purdue with Creighton emerging victorious. This is an interesting pick based on the fact that Purdue has been in the top ten teams of the country for the majority of the year but is coming off a disappointing Big Ten Conference championship game loss against Wisconsin.

West Region – Winner: North Carolina

The West is another region that based on the model, doesn’t look like there will be a lot of surprises. Our model is predicting the higher seed winning each of its 1st round games and only one upset in the 2nd round with St. Mary’s over Alabama. The model picked clear winners across the board here until the final game which is predicting North Carolina against Arizona, two teams that match up athletically and analytically. The model gave a slight edge to North Carolina for this prediction, possibly based on past tournament experience and success.

In Conclusion

In all, machine learning and AI are amazing tools that can really help one understand the teams and data behind them. Being able to get a full picture of a team with data compounded from over 20 years is a powerful tool that can uncover insights and small advantages that can traditionally be overlooked. Overall, from a technical perspective, it took me longer to set up the Excel and the proper brackets then it did to run the analytics.

As mentioned, Kaggle.com has several sets of similar data if you want to run your own model. I suppose the last part to wrap this all up is to give you my model’s final predictions. Based on all the data, Uconn is predicted as having the best chance to both make it to the championship game and win it, edging out both Kentucky and Creighton by a few percentage points.

Previous
Previous

Examining Search and Download Speed

Next
Next

Metadata Kate Gate