11 March 2010

How can annual average temperatures be so precise?

The comments on what should be reproducible raise the subject's question -- since a given thermometer reading is only within, say, 0.8 degrees, how can we claim to know the annual average temperature to 0.01 degrees for the globe?

One thing to remember is that the 0.8 is not the size of the error on every single observation.  Some will be extremely close to correct, and some will be 0.4 off -- large, but not the 0.8.  The 0.8 error is the range that we expect to see 95% of the observations be better than.  Still, with errors that large, how can we get a global average that's within 0.01?

I hit on an experimental way to demonstrate this, without requiring you all to set up thousands of meteorological stations around the world and then collect observations for a year.  Namely, get yourself 8 coins.  If you prefer making computers do things, a spread sheet will work as well, or you can write the program from scratch.
What we're doing here is saying that the thermometer's actual reading is the true temperature it should have given you, plus a random error.  We'll use the coins to tell us how large the error was.

Toss the coins.  Heads means 'too warm by 0.1 degrees', and tails is 'too cold by 0.1 degrees'.  If all 8 coins came down heads or tails, you've got the 0.8 degree error that can indeed sometimes happen with a real thermometer.  If it's 7 heads and 1 tail, then the error is 0.6 degrees too warm.  And so on.  Just count up how many more heads than tails, and that's how many tenths of a degree too warm your thermometer was.  If it's more tails than heads, the extra tails is how much too cold it was.

That's for one day's temperature observation.  You'll see quite a few large errors, several tenths of a degree.  But also a fair number of zero errors -- really meaning errors less than a tenth of a degree.  Still, individual observations will show large fluctuations (errors).

Now repeat the observations many times, let's say 365.  For some reason that number came to mind.  Average every 4 errors.  Every one of the observations has some error.  But you'll see that the average of 4 observations shows smaller error magnitudes than the individual observations do.  Some large positive errors get balanced by large negative ones.  Or maybe they're all positive, but some are large and some are small -- so the average isn't as big.  Now average every 16 observations ('January 1' through 16, 17th through February 1, and so on).  The averages of these errors is even smaller.  But do the experiment yourself; don't take my word.

I've worked out (details at the bottom for the math-lovers) the statistical expectation here.  Even though we see a lot of individual measurements with errors of 0.2 and 0.4, probably a few of 0.6 and even a couple 0.8s, when we add up all the errors, positive errors being cancelled by negative errors, the statistics tell us that our annual average has a standard error of 0.0148 degrees.  Far smaller than the individual observations -- errors in one direction cancel errors in the other, for the most part.  We still have some error expected, just far less.  That's for going from one observation with standard error of 0.28 (the 0.8 is a bad case outcome) to the average of 365 observations.

What do you think happens if we average 3000 records (the ballpark for number of surface stations), each of them having a standard error of 0.0148 degrees?  Same idea, our expected error goes down again.  In this example, to 0.00027 degrees.  Wow!  The real world is not quite as friendly as this example, but clearly by averaging a lot of observations, we can get to answers that are far closer to correct than any single observation.  [updated -- real world is not quite as friendly as this example.]


My coin tossing program is available at http://www.radix.net/~bobg/blogsupport/coin.c.
The graph shows the observation errors for each of the 365 observations.  Notice that I did encounter 2 errors of 0.8 degrees.  Nevertheless, the error in the average was0.0082, almost twice as good as I expected.


For a wildly different example -- looking at variable stars by eye -- see the article Horatio mentioned by Tamino The Power of Large Numbers.  This is a very powerful, very general principle.

Re-Update:
See also False Precision – It Doesn’t Matter by Chad, for a demonstration using global model output sampled in a way similar to how the historical climate network data are collected.


For the math:
This is an example of a Bernoulli sampling.  The expectation for the mean number of heads is N*p, where N is number of trials and p is the probability of the coin coming up heads.  Since our case is taking the difference between number of heads and number of tails, the expected difference is 0 -- p(head) = p(tail) = 0.5.  The standard deviation expected is sqrt(N*p*(1-p)).  N is 365*8, giving us an expected standard deviation in number of heads = 27.  Again, we're taking heads - tails as the fundamental quantity, so 27 excess heads means 27 excess tails.  Expected total deviation is 54*0.1 (the 0.1 degrees per excess).  Now divide by 365, our number of trials to get an average, and the expected standard error is 0.0148.

You can work it back through, and notice that the expected standard error is proportional to s_0/sqrt(N), where N is number of samples averaged, and s_0 is the standard error of an individual observation.  The more observations you have, the better the average can be known.



It would be better if  the number of trials, number of coins, and error size (the 0.1 degree I mention above), were all arguments to the program.  Better still to make the seed to the random number generator an argument, so that you can run many independant trials without having to edit the program.  Still, the program gives you a starting point.

7 comments:

Chad said...

I've taken up this matter of false precision on my blog: http://treesfortheforest.wordpress.com/2009/12/14/false-precision-it-doesnt-matter/

carrot eater said...

"The real world is quite as friendly as this example,"

Is a 'not' missing in there?


By the way, I very much appreciate the concept of your blog.

Robert Grumbine said...

Chad:

Nice article, I'll update my main note with your link.

Carrot:
Also noted in email by Hank. D'Oh. Correcting the note now.

And thank you.

Chad said...

One note: my post doesn't use the US surface temperature observing network to make the point about false precision. It uses a climate model and looks at global averages.

Robert Grumbine said...

Chad:
Sorry. It's what I get for reading too casually (just checking that your note was on point). Updated mention now.

Michael Hauber said...

I think its worthwhile delving into the topic of systematic vs random errors. Random errors are purely random, and if you know the actual error for one measurement, that will not give you any information on the error of the next measurement. Systematic errors are not random, and the same error can be repeated over many measurements.

If the error is purely random, then the overall error reduces as the sample number increases. But if the error is systematic, then the overall error does not aways reduce as the sample size decreases.

For example a thermometer may have a callibration error so that the scale is slightly off, and always reads 0.5 degree to high. This error repeates for every single measurement, and if you took 10,000 measurements with the same thermometer, averaged them all, the actual error would be quite close to 0.5 degrees, and would not become small.

If you know nothing else, but the accuracy of a measurement, then it is not a bad rule of thumb to be pessimistic, and assume that the error could be 100% systematic. So if thermometers have an accuracy of +/- 1 degree, applying this rule of thumb would say that the error cannot be assumed to be any smaller than +/- 1 degree.

Trouble arises when someone is taught this reasonable rule of thumb, and then insists that this rule of thumb is an unalterable law of measurement. Without considering why this rule of thumb was a good idea in the first place, and the difference between good places to apply this rule of thumb and bad places.

In the case of global temperature measurements the primary issue is an error in trend. So if the error is systematic it may not matter at all - if the thermometer is callibrated wrong, or there is a microsite bias, the temperature will be too high now, and too high 100 years ago, so the trend is still the same and has no error (due to this cause).

It is when a systematic error in temperature measurement changes, that we could be in trouble. So if thermometers were all callibrated to give a cool reading 50 years ago, and are all callibrated differently today to give a warmer reading we would have a big problem. It seems a reasonable assumption that with many different thermometers that the bias between different thermometers will tend to act like a random error and at least partially cancel out.

Robert Grumbine said...

Michael:
Random vs. systematic error is part of why I mentioned in the article that nature is not as nice as pictured. The distinction is one that seemed worth a full post of its own.

On the other hand, it seems strikingly unrealistic to assume that all error in measurement might be systematic. You'd have to believe that your instruments had absolutely nothing that could contribute to random errors, which is unheard of for any real system.

In terms of effects on trends, a systematic bias disappears since the trend will be the difference between the true temperature in 1900 (plus its systematic bias) and the true temperature in 2000 (plus the systematic bias). When we do the subtraction, the systematic bias cancels.

At least it does, as you note, if the systematic bias is unchanged. Not so much if there's a systematic bias in 1900, and a different one in 2000. In that case, the change in bias itself could start to produce an apparent (but false). But then you go back to the instruments themselves, their usage, and their calibration (and re-calibration), and examine how large the systematic biases are or can be versus the size of signal you are trying to examine.

This concern is very old, and had much to do with international standardization of the surface thermometer networks and their exposures in the 1800s and since. Instrument systematic biases are well under 1 degree and have been for a long time. Changes in biases are even smaller. Played a role, a friend told me in graduate school, in selecting 30 years as the standard climate averaging period -- long enough that instrument system changes weren't a significant part of your signals. (But I never asked him for the reference, and I've never found it myself.)

In other news, the SST problem you mentioned was fixed a couple weeks ago. I hope that things look more sensible. If not, please email the addresses on the RTG web page.