19 February 2015

Forecast Evaluation

Boy, blow one historic blizzard forecast and people get all cranky*.  Except, as H. Michael Mogil discusses, it was an almost perfect forecast.  For the specifics of that storm and its forecast, I refer you to Mogil's article.

I'm going to take up the more narrow topic of forecast evaluation.  (Disclosure: I do work for NOAA/NWS, but, as always, this blog presents my thoughts alone.  Not least here, because I agree more with Mogil than the head of the NWS, Louis Uccellinni, about this forecast.)  One school of forecast (or model) evaluation looks at computing large scale statistics.  The most famous one for global atmospheric models is the 5 day, 500 millibar (halfway up the atmosphere), wave number 1-20 (large scale patterns), anomaly correlation.  When people refer to the ECMWF model (or 'Euro') being better than the NWS's model (GFS), this is usually the number that is being compared.  But I don't live halfway up the atmosphere, nor do most of you.  We're somewhere near the bottom of the atmosphere.  And there is much more of interest than just average temperature through a layer of the atmosphere.  So there are many other scores (dozens of them) -- See http://www.emc.ncep.noaa.gov/gmb/STATS_vsdb/ for some examples and discussion of what the scores mean.

Most of those scores, though, don't get to my personal -- weather forecast consumer -- interest.  Namely, I'm trying to make a decision of some kind.  NYC, which heard a forecast of 24" (60 cm) but got 9" (22 cm), presumably made decisions that they wouldn't have if they'd heard the perfect forecast that hindsight now provides.  It's here, I think, that we get to the meat of forecast evaluation.  Had this same error been made over the ocean, rather than over the most populated city in the US, with the rest being as it happened, the NWS would be getting praised for their great forecast.  The important part was not difference between reality and forecast, but number of people who made the wrong (in hindsight) decisions.

So let's explore evaluating forecasts by way of our decisions.  I don't make decisions for major metropolitan areas, and not about street plowing and so forth, so will leave that aside.  One realm of weather-affected decisions is in my running.  Let's ignore summer decisions (I'd as soon avoid thinking about what summers are like here) and go with the path as temperatures drop.  Normal gear -- in pleasant weather conditions, is t-shirt and shorts.  Once it cools below 60 F (16 C), I pull on a pair of gloves for my run.

Temperature is my forecast variable of interest.  Wear gloves, or not, is my decision.  If the forecast is right, we're not yet done.  For my purpose here, 'right' is just being on the correct side of 60 F.  But forecasting isn't perfect, and doesn't have to be in order to be useful.  We need to consider the rest of the 'payoff matrix'.  If I wear gloves, and shouldn't (forecast to be below 60, but it is actually above), well, that's a minor nuisance.  Give it -1 point.  It isn't entirely negligible.  If it were, I'd just carry gloves all the time.  The other side is if I don't wear gloves, but should have.  This leads to a very unpleasant run, much more unpleasant than the nuisance of carrying gloves I didn't need.  Give that, say, -5 points.  And, to round out the matrix +1 point for the correct forecast.

Payoff MatrixShould have worn glovesShould not have worn gloves
Did wear gloves+1-1
Did not wear gloves-5+1

I have two things to consider in evaluating the forecasts.  First, forecast versus observation (the verification matrix).  Second, how much I care (my payoff matrix).  To score the forecast in terms of my decision, we multiply each cell from my payoff matrix by the corresponding cell in the verification matrix, and add up all those numbers.  If the forecasts were perfect, the score/value would be +1.  If it were always wrong, and always saying it was warmer than 60 but turning out always to be colder, the score would be -5.  Reality will be somewhere in between.  At some point, the value/score is positive.  As long as the score is positive, it makes more sense to listen to the forecast (it is useful) than to ignore it.
Verification MatrixForecast > 60FForecast < 60 F
Observed > 60F0.50.05
Observed < 60F0.050.4

(I've just made up the numbers here. For a real application, you have to check out your local forecast source.)

Given this verification matrix, the forecast score is 1*0.5 + (-1)*0.05 + (-5)*0.05 + 1*0.4, or 0.6. Not too bad versus a perfect score of 1. But room to improve. To make me happier, the best thing for the forecasters to do is to not make as many mistakes predicting weather to be above 60 when it turns out to be colder. Given my preferences, they can be wrong 5 times as often in predicting cold than predicting warm. So, if they have to hedge the forecast, they should hedge cold.

There's a different side to evaluating forecasts in terms of decisions.  Namely, the forecast can have a perfect score even if it is 'always wrong'.  The forecast is 'wrong' if the observed temperature is different (at all) from the forecast.  But as long as I always make the correct decision, I don't care.  If the forecast is 68 F (20 C), and observed is 63 F (17 C), I'm still happy -- I correctly did not wear gloves.  If it was wrong as much in the other direction, I still don't care.  For 73 F, I still don't wear gloves.  On the other hand, if the forecast is 61 F, and observed is 59 F, I care a lot -- that's a -5 point forecast.  If it's forecast 61 F and observed is 63 F, perfect forecast -- for my decision.  Two things here -- my real scoring system is not symmetric (a +2 degree error isn't always equally bad as a -2 error), and it isn't homogeneous (how much I care about the +2 error also depends on what the forecast was).

Most of the scores at the top are symmetric and homogeneous.  They kind of have to be, because the NWS doesn't know everybody's decision process and payoff matrix.  They just try to drive down the errors everywhere (with some extra attention to the US) and hope that makes everybody happier. 

There's a lot more to evaluating forecasts.  This is just a start.  I'll be interested in other people's weather-affected decisions and how you evaluate forecast quality.

*Ok, I've been slow to post, and by now we've had several 'historic' storms.  The one I'm referring to is the January 26-28 storm that was indeed historic for Boston, but merely relatively heavy in NYC.


Everett F Sargent said...


The people I once knew (some I still know but its been awhile) who do a similar thing with energy based water wave models (Fleet Numerics real time and nowcast) and the USACE (hindcasts) pretty much have the same problems in terms of spatial-temporal drift (mostly spatial AFAIK). I'd like to think they get the long term statistics close to reality.

Haven't been a runner in like two decades. Wind, temperature, humidity and time of day was most important, tried to get the clothing right, but you can always take something off until your almost naked, can't do the converse, so it was always a good idea to be a bit conservative (more clothing) versus the opposite.

Anonymous said...

Ah, but a truly good forecast should not only give you a best estimate, but a range, and allow you to apply your personal decision factors. Eg.,

"61 +- 2 degrees": best estimate is above 60, but given you hate being cold, you'd probably wear gloves. It isn't the forecast that should hedge, it is the user.

(also I'd say that the key comparison is not "is it positive" but "can I beat the best naive strategy". And in this case, the best naive strategy is, "always wear gloves", with a payoff of 0.55*+1 + 0.45*-1, for a score of +0.1. At less than that score, the forecast is not useful)


Robert Grumbine said...

@Everett: somehow I got about 10 copies of your comment.

@mmm: Well, you're right. But also more elaborate than I wanted to start.