Saturday, September 28, 2013

WARSCOR Revisited

I like tweaking.  I've tweaked WARSCOR a bit since its inception.  And I've never really been satisfied.  Because while I've done good work with it, and I think it has (in a lot of ways) improved, there's are two things I've been really unhappy with.  The first is that each revision of WARSCOR has made it more complex than the previous iterations.  As if it weren't bad enough that I'm using a system (Wins Above Replacement) that's controversial, I'm also further complicating it by using averages that aren't just finding the standard mean, which is confusing.  And then I'm using Wins Above Average which, although a more intuitive measure than WAR, is still confusing and confounding.  The second thing is that the system is a little arbitrary.  I mean, there are the two categories of "career" and "peak."  And each category has three subcategories:  total value, vanishing value (first number times x, second number times x-1, third number times x-2, etc.), and a vanishing value that starts at a higher number, so that the differences between season n and season n+1 are closer together.  Career, obviously, is not arbitrary; but for peak, I chose 10 years.  I claimed this wasn't arbitrary, because it's the minimum requirement for the Hall of Fame.  That's just stupid.  It's arbitrary.  And what's worse, the numbers I chose for the vanishing coefficient (starting at 30 and 45 for career, and 10 and 15 for peak, respectively) are arbitrary, as well.  And then I compound the whole thing by doing it again with WAA - and taking yet another odd average!  The main point of this whole exercise was to develop a formula in which I had confidence.  And while I like the results being put out by the current iteration of WARSCOR, the arbitrariness of the whole system is something I find really irksome.  So it's, more or less, back to the drawing board that I went.  And I finally came to a solution that I found palatable.

Let's start with the stuff I got right.  Number one, boiling a whole career down to one number that can serve as a quick reference point; that's good - but WAR already does that on its own.  So, number two, weighting peak and career differently, and giving them input into the said one number - that's really good.  That, we have to keep.  Number three, I liked that I sorted a player's career, starting with his best season, leading to his worst.  Number four, the idea of the vanishing coefficient is salvageable, and definitely does something for balancing peak and career values.  Number five, making sure to remember that any one-number system is, by nature, the start of the conversation, and not the end is the most important lesson of all.  We're definitely sticking with that one.  But everything else is fair game.

First of all, I'm scrapping WAA from the formula altogether.  There's no reason to include it.  I understand why a lot of people (people for whom I have a great deal of respect, as well!) want to use it to derive peak value (Adam Darowski at the Hall of Stats and Tom Tango over at his blog are both proponents of this line of thinking).  And many of those thinkers (including both of the aforementioned ones) believe in only including WAA for positive seasons.  That's not really my boat, because I think you have to account for value if it happened, for good or for ill.  Nonetheless, while I respect these other people, I can't help but think that if there's value in being between replacement and average, we must account for that.  Anyway, I think WAR will work just fine.  Most of all, there's the frustration many people (like me) feel in that, for example, with WAA, a perfectly average pitcher gets 0 WAA for 200 average innings, while a September call-up who throws 4 decent innings could easily have something like a 0.2 WAA.  That's just wrong, and I want the Tommy Johns and Jamie Moyers and Craig Counsells of this world to get credit where credit is due.  It's hard to be a MLB player, particularly to be one above replacement level!

Second of all, I never liked that there were six categories getting averaged.  That kinda defeats the purpose of the whole "coming up with a system" idea.  I mean, you want to simplify things, not make them more convoluted than ever.  My thinking at the time went something like the idea of crowdsourcing:  get a bunch of different by similar measures together, and ask them to spit out a number; that number will be better for having had many inputs.  There's some amount of wisdom in that, but not enough to inspire the level of confidence for which I was hoping.  First of all, six isn't nearly enough inputs.  I should have had 60 if I wanted any confidence.  But 60 is too many, so that wouldn't work.  So we're scrapping the idea of six categories, too.  We're going down to one category.  It must include peak and career, but just one category.

Finally, we come to the vanishing coefficients.  Ah the vanishing coefficients.  Actually, they were a good idea.  It's a thing that Bill James does a lot in his work.  I remember an article in the New Bill James Historical Baseball Abstract about the great pitching rotations of all time, and he scored them by giving one point for each Win Share by the top starter, two for each WS by the #2 guy, 3 for each for the #3, etc.  The idea was that the more balanced the rotation, the higher the score - basically, so you don't end up saying the 1985 Mets were the greatest rotation of their generation because Doc Gooden was so good that, no matter who else was in the rotation, they'd come out on top.  Anyway, the vanishing coefficients were designed to mimic that.  But I had to pick arbitrary points.  And I didn't like that.  Not only did I not like it because it was arbitrary, though.  For example, one of them for the whole career started like this:  45(x1)+44(x2)+43(x3)....  This is terrible, because 44/45 is not the same as 43/44 - so the relationships between the terms aren't consistent.  So that would have to be fixed.

And it is!  We're keeping the vanishing coefficient idea, because that's how we can be sure to include "peak" in the "one-number" that I'm coming up with.  The idea is actually really simple:  to keep the coefficients in relationship, I'm going to make the relationship non-linear (by squaring) and use decimals to make sure that the weights start at one.  Like this:  x1*0.8^0+x2*0.8^1+x3*0.8^3...; which essentially translates to 1(x1)+.8(x2)+.64(x3)..., which I like a lot better.  There's still sort of a linear component (because the power is going up by one for each term) and a quadratic one.  The only thing left to do was to choose the constant.

For the constant, I chose 0.9 - and there are at least two reasons for this.  Number one, "9" is the great number of baseball numerology:  nine innings, nine fielders, nine in the batting order, a forfeit is recorded as 9-0, etc., etc.  It's a wonderful thing.  But more importantly, I tried some other ones.  I tried .8, and it came out WAY overvaluing peak.  It basically said that if Mike Trout puts up 7 WAR next year, he'd have a better HOF case than Yogi Berra.  That's a little messed up, in my opinion, great as Trout has been and as much as catchers always pose problems for this kind of system.  So I tried numbers closer to 1, and 0.99 and 0.95 both kept putting things out that were much, much too close to the order in which players appear, simply based on career WAR.  Actually, using 0.9, the resultant order is pretty close to the order that WARSCOR 2.0 put out.  But that wasn't the goal, per se - but it was nice to see.  So here's what we do for WARSCOR 3.0:

Take the WAR accumulated by a player in each season of his career.  Sort from greatest to least.  Number them, starting with 0.  We will call this number n.  The numerical value for the WAR of each season we will call x, with x1 representing the best season, x2 the second best (as I have done in the entirety of this post so far).  So the formula is simply x1*0.9^n + x2*0.9^n + x3*0.9^n . . .

Here's a sample player.  Johnny Bench played 17 seasons in the Major Leagues, accumulating 75.2 WAR (baseball-reference version).  In order from greatest to least, with the first term being numbered "0," they were as follows:

0.  8.6
1.  7.8
2.  7.5
3.  6.6
4.  6.1
5.  5.6
6.  5.0
7.  5.0
8.  4.7
9.  4.6
10.  4.5
11.  4.1
12.  3.3
13.  1.1
14.  1.1
15.  0.0
16.  -0.5

For Bench's career, that means we do:

8.6*0.9^0 + 7.8*0.9^1 + 7.5*0.9^2 + 6.6*0.9^3 + 6.1*0.9^4 + 5.6*0.9^5 + 5.0*0.9^6 + 5.0*0.9^7 + 4.7*0.9^8 + 4.6*0.9^9 + 4.5*0.9^10 + 4.1*0.9^11 + 3.3*0.9^12 + 1.1*0.9^13 + 1.1*0.9^14 + 0*0.9^15+ -.5*0.9^16 = 46.9

Compare that to Bench's teammate, Pete Rose, who played 24 seasons, accumulating 79.4 WAR - just more than Bench.  I'll spare you the list, but suffice it to say that Rose's total was 46.8 - just a hair below Bench, instead of above him.  Bench's stronger peak outweighs Rose's longer hangaround value.  And if you think Rose's really negative years are dinging him here (his four worst seasons were .4, .9, 1.1, and 2.1 - a total of 3.5 -  Wins Below Replacement), those seasons total cost him only -.4 WARSCOR points - not even close to the 3.5 that he's dinged by just using standard WAR - and yet, they're still accounted for.  And yes, those are enough to make up the difference between Rose and Bench.  So while they do impact how these two rank, it's still an interesting exercise, don't you think?

It's simple, it's elegant, it does exactly the job that WARSCOR was intended to do.  I now firmly believe that WARSCOR is every bit as sensible as any other HOF measure out there:  JAWS, CAWS, the Hall of Stats - any of them.  I'll take WARSCOR 3.0 as my pick.