Thursday, March 29, 2012

Predicting Three Point Shooting on a Player Level

As I said in the introductory post, I've been looking primarily at three point shooting. Ken Pomeroy in his blog ran some first half/second half correlations from the 2010-11 season, and found that there was very little correlation on either offensive or defensive three point percentage from looking at first half/second half splits in conference play. This led to him wondering if the three point line was actually a lottery. Notably, he did find a substantial correlation between first and second half attempts, so at the very least, teams could control how often they played the lottery, if not the results.

This piqued my interest, and I figured that there had to be more to the story. After all, you'd be crazy to think that good shooting players have merely gotten lucky. So, I decided to take a somewhat deeper look at this using Bayesian statistics. (I wrote an article for the 2+2 magazine which may or may not be in it in the next few days that discusses the methods behind it, so I won't go over that again). Anyways, using MCMC, I was able to produce the following prior beta distributions for three point shooting percentage, broken down by class year:Makes sense, freshman are terrible, sophomores take a big jump forward, and then there's smaller steps forward every year after that, diminishing returns and all that.

Priors are nice and all, but they better reflect reality. So, I took a look at all players in my database (08-09 through March 11, 2012) and compared the expectations for my model and the results in reality. To do this, I took each player year and split it in half. Each year was treated individually, so we didn't add up a player's freshman stats and the first half of his sophomore year to create the posterior for his sophomore year, just the first half stats. That resulted in 12,858 player seasons to be looked at, only looking at players who took at least one three in each half of the year. To get an expected second half shooting percentage, I took the prior distribution for that player's class year, used Bayesian updating based on the number of shots made and missed during the first half of the year to create a new posterior beta distribution, and then calculated an expected mean from that. Then I sorted the player years by expected second half three point percentage, put the players into 10 bins, and looked at how many shots they took and how many shots they made during the second half of the year. That resulted in the following graphs:









Those graphs look pretty good! The best shooters take by far the most shots, the worst shooters take the least shots, and the general trend for three point percentage tracks reasonably closely. Case closed, right?

Unfortunately, no. Looking at each class individually paints a decidedly different picture:



























































































Our sample sizes are smaller, so we should see a little more randomness, but each bin still has several thousand shots in it, so probably not that much. Even more worrying is the amount of shots taken by each group. When you lump everyone together, the best players take the most shots, and the worst players take the least. When you look at each class individually though, the best players still take the most shots, but the average players take the least amount of shots, not the worst. Thankfully, the players we expect to be the worst are significantly better than my model expects them to be in all five years, so I can't go all Dave Berri and call coaches dumb just yet.

So what's going on? I haven't run the numbers, but I have a theory I'm pretty confident of. Basically, breaking players down by class year isn't nearly enough. There are subpopulations in those classes which have radically different distribution of talent, but you lump them all together, those differences get lost. Take the worst shooters. To get posterior distributions, I used Bayesian updating on my prior beta distributions to create a new beta distribution with parameters α+makes and β+shots-makes. If a player took very few shots in the first half of the season, his posterior graph would be very close to the prior, and therefore we'd expect him to fall somewhere around the mean. In order for a player to fall into the bottom 10%, merely missing a high percentage of shots isn't enough: he would have needed to take a lot of shots too in order to make his posterior significantly different from the prior. But, players don't randomly take shots. Coaches let their best shooters take a lot of shots, and limit the number of shots from their worst players. So, when a player falls into the bottom 10%, it's more likely that he's a decent shooter who is shooting below expectation than it is that he's a bad shooter that has taken a lot of shots.

Note too that the same problem occurs for the best shooters. They overperform the model's expectations for all five class years as well. I suspect that if I create prior distributions based on some combination of class year, 3pFGA/FGA, and usage, they would see a bump higher as well. That will be my next post.

No comments:

Post a Comment