The comments on what should be reproducible raise the subject's question -- since a given thermometer reading is only within, say, 0.8 degrees, how can we claim to know the annual average temperature to 0.01 degrees for the globe?
One thing to remember is that the 0.8 is not the size of the error on every single observation. Some will be extremely close to correct, and some will be 0.4 off -- large, but not the 0.8. The 0.8 error is the range that we expect to see 95% of the observations be better than. Still, with errors that large, how can we get a global average that's within 0.01?
I hit on an experimental way to demonstrate this, without requiring you all to set up thousands of meteorological stations around the world and then collect observations for a year. Namely, get yourself 8 coins. If you prefer making computers do things, a spread sheet will work as well, or you can write the program from scratch.
What we're doing here is saying that the thermometer's actual reading is the true temperature it should have given you, plus a random error. We'll use the coins to tell us how large the error was.
Toss the coins. Heads means 'too warm by 0.1 degrees', and tails is 'too cold by 0.1 degrees'. If all 8 coins came down heads or tails, you've got the 0.8 degree error that can indeed sometimes happen with a real thermometer. If it's 7 heads and 1 tail, then the error is 0.6 degrees too warm. And so on. Just count up how many more heads than tails, and that's how many tenths of a degree too warm your thermometer was. If it's more tails than heads, the extra tails is how much too cold it was.
That's for one day's temperature observation. You'll see quite a few large errors, several tenths of a degree. But also a fair number of zero errors -- really meaning errors less than a tenth of a degree. Still, individual observations will show large fluctuations (errors).
Now repeat the observations many times, let's say 365. For some reason that number came to mind. Average every 4 errors. Every one of the observations has some error. But you'll see that the average of 4 observations shows smaller error magnitudes than the individual observations do. Some large positive errors get balanced by large negative ones. Or maybe they're all positive, but some are large and some are small -- so the average isn't as big. Now average every 16 observations ('January 1' through 16, 17th through February 1, and so on). The averages of these errors is even smaller. But do the experiment yourself; don't take my word.
I've worked out (details at the bottom for the math-lovers) the statistical expectation here. Even though we see a lot of individual measurements with errors of 0.2 and 0.4, probably a few of 0.6 and even a couple 0.8s, when we add up all the errors, positive errors being cancelled by negative errors, the statistics tell us that our annual average has a standard error of 0.0148 degrees. Far smaller than the individual observations -- errors in one direction cancel errors in the other, for the most part. We still have some error expected, just far less. That's for going from one observation with standard error of 0.28 (the 0.8 is a bad case outcome) to the average of 365 observations.
What do you think happens if we average 3000 records (the ballpark for number of surface stations), each of them having a standard error of 0.0148 degrees? Same idea, our expected error goes down again. In this example, to 0.00027 degrees. Wow! The real world is not quite as friendly as this example, but clearly by averaging a lot of observations, we can get to answers that are far closer to correct than any single observation. [updated -- real world is not quite as friendly as this example.]
My coin tossing program is available at http://www.radix.net/~bobg/blogsupport/coin.c.
For a wildly different example -- looking at variable stars by eye -- see the article Horatio mentioned by Tamino The Power of Large Numbers. This is a very powerful, very general principle.
See also False Precision – It Doesn’t Matter by Chad, for a demonstration using global model output sampled in a way similar to how the historical climate network data are collected.
For the math:
This is an example of a Bernoulli sampling. The expectation for the mean number of heads is N*p, where N is number of trials and p is the probability of the coin coming up heads. Since our case is taking the difference between number of heads and number of tails, the expected difference is 0 -- p(head) = p(tail) = 0.5. The standard deviation expected is sqrt(N*p*(1-p)). N is 365*8, giving us an expected standard deviation in number of heads = 27. Again, we're taking heads - tails as the fundamental quantity, so 27 excess heads means 27 excess tails. Expected total deviation is 54*0.1 (the 0.1 degrees per excess). Now divide by 365, our number of trials to get an average, and the expected standard error is 0.0148.
You can work it back through, and notice that the expected standard error is proportional to s_0/sqrt(N), where N is number of samples averaged, and s_0 is the standard error of an individual observation. The more observations you have, the better the average can be known.
It would be better if the number of trials, number of coins, and error size (the 0.1 degree I mention above), were all arguments to the program. Better still to make the seed to the random number generator an argument, so that you can run many independant trials without having to edit the program. Still, the program gives you a starting point.
ASI 2016 update 3: crunch time
6 hours ago