07 January 2015

Edging towards a climatology

I say edging towards climatology because the process of going from here, a state of not really knowing what the climatology is, to there, a state of having pretty solid knowledge, isn't one I like to take in a single jump.  Even if scientists in professional journals present their work as if we did it in one jump, we seldom do it this way.  Plus, for our purposes here, it's more meaningful to proceed in successive approximations

For data, I'm going to use the Climate Forecast System Reanalysis (v2).  I'll also be using the high resolution, in time and space, versions of the data.  This leads to some pretty big files (unpacked, it is about 2 Gb per month, and remember there'll be 360 months for a 30 year climatology).  So you might want to go with the lower resolution for your own initial exploration.

To start with, let's look at the 2 meter air temperature, where I've converted temperatures to Celsius (from Kelvin).  30 C = 86 F, 0 C = 32 F.  The total planetary range is a bit over 70 C from the very coldest areas (Antarctic Plateau -- below -40 C) to the warmest (pretty much the whole tropics).

We also see in the Himalayas and Andes good illustration that higher elevations are colder.  A different matter is that this map display (as do all) distorts the areas of the earth.  In truth, half the earth is between 0 and 30, a third is between 30 and 60, and only one sixth is 60 to 90.  Handy figure, by the way: 10 million square kilometers is about the size of the US, Antarctica, Canada, Brazil, and China.  All are about the same size, and a nice round number.

If we average the temperature properly, by area, the global mean in 1981 was 14.5 C.  Given the map, that's a little surprising.  So let's split out how much of the earth (area) is in different temperature bins, by degrees C:
So, although the global average is 14.5 C, the most common annual average bin is 27 C.  But things are extremely asymmetric.  You're much more likely to land in an area colder than 27 in annual average than warmer.  Only about 1.6 million km^2 is warmer than 28 C.  A comparable area is colder than -44 C.  (1.6 million km^2 is about 3% of the earth's surface.)

This is a case where the mean (14.5) is far from the mode (27).  The median (half the area is warmer, half colder) is about 18.5 -- far from both the mean and the mode.  Shows why we want more than one way of describing statistics.  And why a normal curve (bell curve, Gaussian distribution) is not always the right way to describe your data.

I'm also from an area where we claim the temperatures change enormously day to day and winter to summer.  If the variability truly followed a normal (bell curve) distribution (I doubt it does), then the square root of the variance is the standard deviation.  Here's the map for that:

That's an awful lot of area with extremely little variance -- most of the globe being less than 4 C 'standard deviation', and a huge amount of it less than 2.  I grew up around Chicago, which is about 12 C 'standard deviation', making it far to the high end of the distribution of variability.  Our regional pride is intact!  Notice, please, the intervals on the color bar are not even.  Each color represents about 10% of the earth's surface.

In general, oceans are areas of low variability.  Land is higher variability.  Higher latitudes show higher variability.  The farther you are to the east of a water to land transition, the more variable it is (Eastern Siberia, Central North America -- limited by Hudson Bay).  High elevations might be high variability (Himalayas).  Highest variability in the oceans is off the east coast of continents.

Again, let's split things out by bins.  Global mean for 'standard deviation' is 4.5 C (woohoo, Chicago nearly triples that!):
So we have two peaks -- one near 0.5, and one around 2.5.  Maybe a peak around 8.  The surprising and striking thing, to me, is that most of the earth has rather modest variability.  Half the area of the earth has lower than 2.7 C 'standard deviation'.

I was recently seeing people commenting about climate change of a few degrees being trivial because day to day and seasonal variation is so much larger.  Starts to look like the variability over the globe is actually pretty small, and are even smaller than plausible climate sensitivities.

Armed with this exploration, I've got a better handle on how to proceed in making a 30 year climatology.  One major point is, I need some pretty careful numerical analysis to preserve the rather subtle variations that most of the globe shows. 



A bit of mathematics:
The first point is, the data files are in Kelvin.  That's good for universal understanding -- if a temperature is -35, you can't be sure if it is Fahrenheit or Celsius -- but poor for doing arithmetic on computers.

The thing is, computers do truly carry out real number arithmetic.  This is no surprise to the scientists who work on data analysis, their job includes understanding numerical analysis.  But I don't think I've ever seen it in a climate blog before, so here goes.  I'm going to be adding up the hourly temperatures for all of 1981 (to start with).  That's 8760 hours*.  The temperatures have been saved to a precision of 0.001 K.  So we have numbers (8760 of them) to add up that look like:
273.151
and makes a numerical analyst rather concerned.
When we add up the numbers (to compute the average we add up all the numbers and then divide by how many we have), we wind up with a number like
2392802.76
Or, rather, that's what we get if computers had infinite precision, or we do it by hand (any volunteers?).
Count this off -- there are 9 digits involved.  Ordinary arithmetic on computers uses a number representation that only can show 6-7 digits exactly.  So, although the original number, 273.151, can be represented exactly, after a while, we lose digits.  Let's say this really is our average.  After 1000 hours of the average, our number is (say)
273151.0 and we add the next observation of 273.252 (a little different from the average, but we don't expect numbers always to be average).  Both have 6 digits, so you might think we're safe.  We're not.  When the computer goes to add them, it sees something like:
273151.0
   273.252

You and I know that this should add to 273424.252.  But the computer only can deal with 6 digits here, and will give you and answer of 273424.0** -- it will truncate the smaller number before adding to the larger one.  Losing 0.252 of a degree may not seem like much.  But, remember that the computer will be truncating every single addition from here on, another 7760.  After enough truncations, your accuracy is compromised. (Note: the computer is actually working in binary, not decimal, but the principle holds.  Only so many bits are used to represent a number.)

So .. as a first step use Celsius instead.  In this case, the numbers vary from -60 (Antarctic) to +40 (hot areas).  Now we only have 4-5 digits in our temperature (areas close to freezing will have 4 -- 1.012 C for instance).  So more protection against numerical issues.  (In practice I'm using double precision arithmetic, but this only gets you so far.  Far enough for the average to be reliable even in K.  But finding the standard deviation requires computing and summing squares of numbers, which rapidly runs you out of digits again.)

* Numbers stick in my mind.  This one is from when I was working on my master's degree doing tidal analysis.  The length of a year is an important number for tides :-)

** It may or may not do so in practice.  Depends on some behind the scenes things which I can't guarantee.  But the principle holds.  Try adding up 273.151 a zillion+ times and printing out the results.  Sooner or later, you will see numbers different than what you should.

+ Not sure zillion is globally known.  Think of a really big number.  A zillion is bigger.  It isn't rigorously defined, obviously, but it's handy.

9 comments:

Everett F Sargent said...

Well you did mention double precision.

I can't even remember the last time I used single precision code.

Lately, I've been using quad precision (exponent somewhat north of +/-4400 (~33 digits), double precision somewhat north of +/-300 (~15 digits)).

And, if you want to go there, there is always arbitrary precision.

Of course, the easiest thing to do is calculate x bar first, than do your variance calculation.

Also, I have a tendency to add random white noise to numbers reported with fewer significant digits (i. e. the internal binary floating point precision).

Doing an overflow/underflow calculation, seems rather obvious, in that you don't ever want to do such a calculation, if it can be avoided. Ergo arbitrary precision?

So seeing as you are going somewhere with this, I await your insight.

Robert Grumbine said...

Thanks Everett.

I'm not going anywhere terribly profound. Just some minor warning to readers who haven't been through numerical analysis that we can't just trust the computer to do the arithmetic the way we expect. Given the amount of single precision programming I see, some older hands might also make use of the reminder.

Like you, I normally compute the mean first, and then consider the higher powers -- variance (x^2), skew (x^3), and kurtosis (x^4).

Now that quad precision is available, I've started moving towards it for this kind of calculation.

In the mean time, could you explain the benefit of your white noise addition?

Everett F Sargent said...

Robert,

White noise, not useful in this particular case.

I usually employ it in a discussion of averaging of temperature time series, those that are carrying two or three digits of precision (past the decimal point), when the underlying time series is only to the nearest degree.

I've added arbitrary precision in at least two other cases, can't remember why, likely not necessary, but I've been known (or self aware) to go OC/AR in the past.

As to the assumption of a normal distribution, I think in the root mean square sense (your variance calculation) you can call it sigma, I almost always check the underlying distribution (the higher moments being a good 1st check).

Victor Venema said...

Being a fan of variability and blogging at Variable-Variability, I feel a inner need to say: variability is not so simple.

:) Sorry.

Is the variance you show from the hourly data (including daily cycle) or from the daily averages, monthly averages or annual averages? What is the spatial resolution of the dataset?

For averages it is sufficient to mention the period and region, for variability also the spatial and temporal averaging scale need to be mentioned.

If the data is hourly, is it an hourly average or the instantaneous value at the hour?

Robert Grumbine said...

No need to apologies Victor! Variability, and climate, are never as simple as anyone would hope. I appreciate the discussion of these additional concerns.

The data come from the Climate Forecast System Reanalysis (citation above). Hourly data are instantaneous values once an hour -- from the CFS Reanalysis. They're area average, over the area of the model grid cell -- approximately 1000-2000 km^2 (T382 grid, so variable across the globe).

So far, pretty good. But ...

The operating principle in the CFSR was to use the best available data at the time being considered. On one hand, well, it's the best available data for the time, so that's good news. On the other hand, it means that there are several discontinuities (lack of data homogenization) through the record used by CFSR.

In other words, I have a post or two coming that will point to some of your efforts. Your comments (or a guest post or two) welcome!

Bryan - oz4caster said...

Robert, quite a tough challenge to crunch through all that data, but being a meteorologist who has used that data source over the years, I believe it is a good way to assess global climate. It offers much better global coverage than other methods. I will be interested to see how your results compare with what has been posted at WeatherBell. http://models.weatherbell.com/temperature.php

Robert Grumbine said...

Thanks for the link Bryan. I know of weatherbell by way of Joe Bastardi, who says some remarkably wrong things about sea ice. Looks like he doesn't have anything to do with this page, and it looks reasonable.

For some things, I think the CFSR (and other reanalyses) are the best way to go. Great time and space coverage, for instance. For long term trends, though, not so good because of the homogenization issues Victor and others work on.

For instance, if you work with the NCDC quarter degree sea surface temperature, you have two versions to choose from. One is AVHRR-only (the only satellite instrument used, surface data are also used). The other is AVHRR+AMSRE (plus the same surface data). It turns out that if you use the AVHRR+AMSRE, you have some odd trends and discontinuities. It's a better analysis, on any given day it covers, but adding AMSRE changes long term trends in ways that you don't have without it.

Bryan - oz4caster said...

Robert, I'm not sure exactly what data were used in the WeatherBell analyses. Of particular interest are the three successive "Global Temperature Traces" showing the resulting global temperature anomalies over the period from 1979 through 2014 near the bottom of the page. The graphs appear to show daily anomalies along with a 90-day trailing mean. Interestingly, these results do not show much resemblance to the largely "homogenized" global temperature anomaly trends based on much more sparse surface measurements presented by NCDC, GISS, and HADCRUT. I have not seen any direct comparison of the WeatherBell vs the other three analyses, which would be interesting. The discrepancies between the WeatherBell, NCDC, GISS, HADCRUT, UAH MSU, and RSS MSU may simply indicate that uncertainties are fairly large. My best guess is that uncertainties for annual average global temperature anomaly for all of these methods could be on the order of 0.5C and possibly larger.

Bryan - oz4caster said...

Robert,

I discovered that the University of Maine Climate Change Institute has been working with CFSR as well as ERA data and they have put together an excellent web tool for viewing some of data here:
Climate Reanalyzer.

I used it to make a time series graph of CFSR global temperature anomalies for comparison with the estimates from NCDC, which I posted here in case you may be interested: Climate Concerns.

So far I have not delved much into how the CFSR was created nor how the University of Maine has updated it to 2013. I had to add the Weather Bell estimate for 2014 to complete the graph I made.