Saturday, March 31, 2012

More on Three Point Shooting

In my last post, I introduced prior beta distributions for each class year. However, that model was minimally predictive for each class year. So, I theorized that the problem was in lumping all players in each class together as though they all are part of one underlying talent distribution.

To test this, I decided to break up each class year into subpopulations based on their 3PFGA/FGA, referred to from this point forward as 3PA%. First, I decided I needed a prior distribution for 3PA%. Players with low shot totals in general would tend to gravitate towards the extremes, and needed to be regressed towards the average 3PA%. I won't bore you with those graphs, but depending on the class year, the α and β were around 1 and 2. About what I expected, a mean of .33 and a very flat distribution from 0->1.

Once I had that, I created posteriors for each player by adding their 3PAs and 2PAs to the prior α and β. Then, I split each class year into subpopulations based on those posterior estimates of mean 3PA%. I ran a Gibbs sample on each subpopulation, found the α and β for each one, and graphed the mean, α/(α+ β). That resulted in the following:







As you can see, I fitted a logarithmic curve, and then used that curve to find my new beta parameters for each individual player based on their 3PFGA%. That way, three point snipers who take the majority of their shots from deep get treated as part of one population, while guys who barely take any threes at all are treated as part of another distribution. I then reexamined my plots from last time, but using the new curves as priors.
















That looks worlds better than before. There's significantly less difference between the expected three point shooting and the actual three point shooting, and the number of shots taken marches right in lock step with how good we expect the player to be. In addition, there's much more granularity in being able to spot good shooters from bad, the distribution of talent is much wider, especially for upperclassmen. It's still not perfect, the extremes need some work, especially on the low end, as all of the bad shooters were better than projected, often more than 2SD better. Perhaps this is due to overfitting the curve, while a logarithmic fit looks good, I'm not sure it's actually the right one. For one thing, the limit of log(x) as x->0 is -∞. This obviously doesn't make any sense whatsoever for three point percentage, which can only be in the range of (0,1). While no players have a posterior 3PA% that results in a negative prior mean 3P%, it does serve as a warning that a logarithmic fit isn't actually how talent is distributed in real life. I played around with a few other fits, and while some resulted in a more accurate projection for the bottom quartile, the top quartile, which takes many more shots, was projected less accurately. Perhaps a piecewise fit would be best, something to look at in the future.

Thursday, March 29, 2012

Predicting Three Point Shooting on a Player Level

As I said in the introductory post, I've been looking primarily at three point shooting. Ken Pomeroy in his blog ran some first half/second half correlations from the 2010-11 season, and found that there was very little correlation on either offensive or defensive three point percentage from looking at first half/second half splits in conference play. This led to him wondering if the three point line was actually a lottery. Notably, he did find a substantial correlation between first and second half attempts, so at the very least, teams could control how often they played the lottery, if not the results.

This piqued my interest, and I figured that there had to be more to the story. After all, you'd be crazy to think that good shooting players have merely gotten lucky. So, I decided to take a somewhat deeper look at this using Bayesian statistics. (I wrote an article for the 2+2 magazine which may or may not be in it in the next few days that discusses the methods behind it, so I won't go over that again). Anyways, using MCMC, I was able to produce the following prior beta distributions for three point shooting percentage, broken down by class year:Makes sense, freshman are terrible, sophomores take a big jump forward, and then there's smaller steps forward every year after that, diminishing returns and all that.

Priors are nice and all, but they better reflect reality. So, I took a look at all players in my database (08-09 through March 11, 2012) and compared the expectations for my model and the results in reality. To do this, I took each player year and split it in half. Each year was treated individually, so we didn't add up a player's freshman stats and the first half of his sophomore year to create the posterior for his sophomore year, just the first half stats. That resulted in 12,858 player seasons to be looked at, only looking at players who took at least one three in each half of the year. To get an expected second half shooting percentage, I took the prior distribution for that player's class year, used Bayesian updating based on the number of shots made and missed during the first half of the year to create a new posterior beta distribution, and then calculated an expected mean from that. Then I sorted the player years by expected second half three point percentage, put the players into 10 bins, and looked at how many shots they took and how many shots they made during the second half of the year. That resulted in the following graphs:









Those graphs look pretty good! The best shooters take by far the most shots, the worst shooters take the least shots, and the general trend for three point percentage tracks reasonably closely. Case closed, right?

Unfortunately, no. Looking at each class individually paints a decidedly different picture:



























































































Our sample sizes are smaller, so we should see a little more randomness, but each bin still has several thousand shots in it, so probably not that much. Even more worrying is the amount of shots taken by each group. When you lump everyone together, the best players take the most shots, and the worst players take the least. When you look at each class individually though, the best players still take the most shots, but the average players take the least amount of shots, not the worst. Thankfully, the players we expect to be the worst are significantly better than my model expects them to be in all five years, so I can't go all Dave Berri and call coaches dumb just yet.

So what's going on? I haven't run the numbers, but I have a theory I'm pretty confident of. Basically, breaking players down by class year isn't nearly enough. There are subpopulations in those classes which have radically different distribution of talent, but you lump them all together, those differences get lost. Take the worst shooters. To get posterior distributions, I used Bayesian updating on my prior beta distributions to create a new beta distribution with parameters α+makes and β+shots-makes. If a player took very few shots in the first half of the season, his posterior graph would be very close to the prior, and therefore we'd expect him to fall somewhere around the mean. In order for a player to fall into the bottom 10%, merely missing a high percentage of shots isn't enough: he would have needed to take a lot of shots too in order to make his posterior significantly different from the prior. But, players don't randomly take shots. Coaches let their best shooters take a lot of shots, and limit the number of shots from their worst players. So, when a player falls into the bottom 10%, it's more likely that he's a decent shooter who is shooting below expectation than it is that he's a bad shooter that has taken a lot of shots.

Note too that the same problem occurs for the best shooters. They overperform the model's expectations for all five class years as well. I suspect that if I create prior distributions based on some combination of class year, 3pFGA/FGA, and usage, they would see a bump higher as well. That will be my next post.

Welcome

Welcome to my blog. I began looking into some college basketball stats a little while ago, and decided to publish some of my stuff here. The eventual goal is to create a full fledged ranking system based on component statistics using Bayesian updating, but that's seemingly a ways away, right now I'm still scratching the surface of three point shooting. Guys like Ken Pomeroy already have a really good handle on ranking teams based on their offensive and defensive efficiency, I'm hoping to extend that a little further to be able to look at things such as individual matchups and whether teams have gotten lucky/unlucky based on their component stats.