Post by eric on Nov 1, 2016 15:34:07 GMT -6
The State of Stats
There are so many stats these days I thought it would be helpful to do a brief overview. I won't go into the explicit formula or statistical derivation of any of them, to my mind the more important thing is whether they work.
.
But How Do They Work?
All composite stats have the same goal: how much does Player X help or hurt their team's prospects of winning? They assign various and usually interrelated weights to their inputs and spit out an answer in Wins. (Some metrics say Wins and mean Wins Above Replacement, but it's the same principle.) To accomplish this there are two basic types of inputs: box score and plus minus.
The advantage to stats that only use box score as input is that they have a much larger sample size to work with, going back at least to 1978 when the NBA first started measuring turnovers. It is also usually possible to extend such a stat to at least 1974 (offensive and defensive rebounds, blocks and steals) and even as far as 1952 (minutes played), though of course these necessarily become more like estimates and less like measurements.
The disadvantage is that the box score does not measure everything; floor spacing, hockey assists, boxing out, screen setting, will to win, etc. all have no entry in your local news-paper, so many stats instead rely on plus minus. The disadvantage there is that such records are only available back to about 2001 depending on whom you ask, and cannot possibly be extended prior.
One solution that has gained a lot of currency recently is to use both sets of inputs. We shall see if that solution is effective.
.
Alright, So Which are Which?
Anything that has "plus minus" in its name will use plus minus, and anything that doesn't won't. Unfortunately, all of the ones that do use plus minus do not have publicly available formula. This is a red flag for lots of reasons: there's no way to apply the formula outside the publicly available results, there's no way to consider the formula a priori, if the publicly available database goes down or is terminated there's no way to reproduce it, the methodology could be being misstated by whomever is publishing the results, they could be making honest mistakes in the calculations, they could just be making it up as they go. These concerns are best illustrated with a quick gander at Jeremias Engelmann's Real Plus Minus as published by ESPN.
As stated earlier, plus minus goes back to 2001, so it is theoretically possible to calculate RPM back about that far. Some stats use a season or three of past data to help smooth out year to year variations, but even with that in mind it's very odd that RPM starts only in the 2014 season. The pages of past years mysteriously keep getting updated well into the next season, although as near as I can tell the values themselves aren't actually changed. There's no linked explanation of how the stat works or if/how/why it's better than any other, to find that you have to search the ESPN site, good luck with that. Most critically, they have without announcement or comment whatsoever fundamentally changed how the formula works.
.
Wait a Minute Smart Guy, I Thought You Said the Formula Wasn't Public?
That's true.
So How Do You Know It's Changed?
By an ancient and obscure mathematical technique known as "addition". There are 30 teams in the NBA and they all (usually) play 82 games that must end in exactly one win and exactly one loss, therefore there are 1230 wins to go around every year. If a stat outputs Wins, it should therefore add up to about 1230. If a stat instead outputs Wins Above Replacement, it should output a number smaller than 1230, with the precise amount depending on how that stat defines replacement level. Conceptually a team of replacement level players should be near the worst in most any NBA season but it would nevertheless generate more than literally zero wins. I'll go into this in much greater depth later, but for now let's just look at the past three years of RPM, Justin Kubatko's Win Shares as published by basketball-reference.com, and Daniel Myers' Box Plus Minus as published by the same. The latter two are explicitly described as being flat wins and wins over replacement player respectively, and you'll see what I'm talking about.
WS and Box PM are right around the same spot every year, and Box PM is built around a team of replacement level players generating about 14 wins. The first two years of Real PM look a lot like Box PM except they have a different replacement level of about 12 wins. That's not a major concern. What's a major concern is that the NBA abruptly jumps 300 wins in 2016 according to Real PM. This change is so drastic and it occurring without comment is so bizarre that I even counted up how many minutes played were in each sample to make sure the website itself wasn't somehow horribly broken, but in all years there were about 590,000 MP which is about right. Note that this doesn't mean the website isn't somehow horribly broken! Just that it's not horribly broken in a way that would for example list every Atlantic Division player twice or include the playoffs or something else like that.
.
How Can Mirrors Be Real if Our Stats Aren't Real?
It's a real conundrum. I will say though that I've been looking at these composite stats for a long time and I've never seen anything even close to like this. I've been in contact with the stat's creator and he describes the year over year change as "very slight", so either he's not privy to what ESPN's doing with his methodology or we have wildly divergent definitions of the word slight. In any event, over the coming days I'm going more in depth on various stats. First of all we'll look at the interestingly distinct assumptions they make, then we'll look at the results. Getting the right number of wins in an NBA season is a pretty low bar, let's see who gets the (most) correct number of wins for each team.
There are so many stats these days I thought it would be helpful to do a brief overview. I won't go into the explicit formula or statistical derivation of any of them, to my mind the more important thing is whether they work.
.
But How Do They Work?
All composite stats have the same goal: how much does Player X help or hurt their team's prospects of winning? They assign various and usually interrelated weights to their inputs and spit out an answer in Wins. (Some metrics say Wins and mean Wins Above Replacement, but it's the same principle.) To accomplish this there are two basic types of inputs: box score and plus minus.
The advantage to stats that only use box score as input is that they have a much larger sample size to work with, going back at least to 1978 when the NBA first started measuring turnovers. It is also usually possible to extend such a stat to at least 1974 (offensive and defensive rebounds, blocks and steals) and even as far as 1952 (minutes played), though of course these necessarily become more like estimates and less like measurements.
The disadvantage is that the box score does not measure everything; floor spacing, hockey assists, boxing out, screen setting, will to win, etc. all have no entry in your local news-paper, so many stats instead rely on plus minus. The disadvantage there is that such records are only available back to about 2001 depending on whom you ask, and cannot possibly be extended prior.
One solution that has gained a lot of currency recently is to use both sets of inputs. We shall see if that solution is effective.
.
Alright, So Which are Which?
Anything that has "plus minus" in its name will use plus minus, and anything that doesn't won't. Unfortunately, all of the ones that do use plus minus do not have publicly available formula. This is a red flag for lots of reasons: there's no way to apply the formula outside the publicly available results, there's no way to consider the formula a priori, if the publicly available database goes down or is terminated there's no way to reproduce it, the methodology could be being misstated by whomever is publishing the results, they could be making honest mistakes in the calculations, they could just be making it up as they go. These concerns are best illustrated with a quick gander at Jeremias Engelmann's Real Plus Minus as published by ESPN.
As stated earlier, plus minus goes back to 2001, so it is theoretically possible to calculate RPM back about that far. Some stats use a season or three of past data to help smooth out year to year variations, but even with that in mind it's very odd that RPM starts only in the 2014 season. The pages of past years mysteriously keep getting updated well into the next season, although as near as I can tell the values themselves aren't actually changed. There's no linked explanation of how the stat works or if/how/why it's better than any other, to find that you have to search the ESPN site, good luck with that. Most critically, they have without announcement or comment whatsoever fundamentally changed how the formula works.
.
Wait a Minute Smart Guy, I Thought You Said the Formula Wasn't Public?
That's true.
So How Do You Know It's Changed?
By an ancient and obscure mathematical technique known as "addition". There are 30 teams in the NBA and they all (usually) play 82 games that must end in exactly one win and exactly one loss, therefore there are 1230 wins to go around every year. If a stat outputs Wins, it should therefore add up to about 1230. If a stat instead outputs Wins Above Replacement, it should output a number smaller than 1230, with the precise amount depending on how that stat defines replacement level. Conceptually a team of replacement level players should be near the worst in most any NBA season but it would nevertheless generate more than literally zero wins. I'll go into this in much greater depth later, but for now let's just look at the past three years of RPM, Justin Kubatko's Win Shares as published by basketball-reference.com, and Daniel Myers' Box Plus Minus as published by the same. The latter two are explicitly described as being flat wins and wins over replacement player respectively, and you'll see what I'm talking about.
year Real PM WS Box PM
2014 868 1256 813
2015 870 1256 810
2016 1172 1255 811
WS and Box PM are right around the same spot every year, and Box PM is built around a team of replacement level players generating about 14 wins. The first two years of Real PM look a lot like Box PM except they have a different replacement level of about 12 wins. That's not a major concern. What's a major concern is that the NBA abruptly jumps 300 wins in 2016 according to Real PM. This change is so drastic and it occurring without comment is so bizarre that I even counted up how many minutes played were in each sample to make sure the website itself wasn't somehow horribly broken, but in all years there were about 590,000 MP which is about right. Note that this doesn't mean the website isn't somehow horribly broken! Just that it's not horribly broken in a way that would for example list every Atlantic Division player twice or include the playoffs or something else like that.
.
How Can Mirrors Be Real if Our Stats Aren't Real?
It's a real conundrum. I will say though that I've been looking at these composite stats for a long time and I've never seen anything even close to like this. I've been in contact with the stat's creator and he describes the year over year change as "very slight", so either he's not privy to what ESPN's doing with his methodology or we have wildly divergent definitions of the word slight. In any event, over the coming days I'm going more in depth on various stats. First of all we'll look at the interestingly distinct assumptions they make, then we'll look at the results. Getting the right number of wins in an NBA season is a pretty low bar, let's see who gets the (most) correct number of wins for each team.