After the second what is climate note and the second testing ideas note, how might one go about testing whether climate had changed, and, more specifically, whether a prediction of change were supported or not. At hand is the question of whether the IPCC projection (not forecast) of 0.2 C/decade for the first two decades of the century could be rejected by observations currently in hand. I concluded in the testing ideas post that the methods used there were not useful, as they hadn't compared like to like in making the tests, nor been working with climate variables.

Now that the second what is climate note is in hand, we can make a better sort of test. Still not rigorous, but better than the one I discussed. That is, let us take the 20 year average from 1981 to 2000 (IPCC was allowing 20 year averages, though preferred 30) and compare to the 7.6 year average from January 2001 to July 2008. The latter period is only 91 months, which is short for a climate definition, per our experience in the what is climate note. See again how wobbly the averages are for averages that are only that short.

I took the NCDC data again and computed the averages. (At least I hope so, I might have been off in my start periods, but probably only by a month or two.) For 1981-2000, the average was 0.277 C, with 0.155 standard error (if we also assume the terms are normally distributed, which is a question at this point). For 2001-present, the average was 0.540 with standard error 0.104. To compute the trend, we take the difference between those two averages (0.263) divided by the number of years between the midpoint -- 1990.5 and 2004.8, respectively. We wind up with 0.18 degrees/decade for our slope. Does that represent a significant difference from 0.2? Well, for starters, the 0.2 was a round figure from the IPCC -- their actual words were 'about 0.2'. 0.18 is 'about' 0.2. Eyeballing the spread of the different projections, a difference of 0.02 is within the envelope. Then, to be serious, we'd also have to allow for the errors of measurement that go in to computing the global averages.

The upshot is, to the extent that it's meaningful to test IPCC 4AR projection 13 years too early, observation of global average surface air temperature are pretty close to what's expected from this consideration. Allow for the fact that the sun's been quiet the since 2001 vs. warmer, and there's even better agreement.

One can definitely do a more rigorous test than this. And, for professional publication, would have to.

## 28 August 2008

Subscribe to:
Post Comments (Atom)

## 16 comments:

Assume, for a moment, that the GMST had a 1 degC/decade cooling trend since 2001. Would you still insist that the models cannot be falsified by such a short trend?

If your answer is no then your entire argument falls to peices. If your answer is yes then you are saying that we cannot use GCMs to make policy decisions until at least 30 years has past. This means we should ignore the AR4 projections until at least 2031.

Of course, the IPCC has the option of retracting the AR4 projections and using the TAR projections instead. However, that means we still need to wait till 2020. Going back to the TAR would also raise the sticky question of why the IPCC updated the projections in AR4 if the TAR projections were right.

It would be better to calculate the trend using ordinary least squares regression or some variant. I've done work very similar to Lucia's, but I've been comparing the temperature trends in model data on Jan-2001/July-2008 to see how well they compare to the surface temperature records. I know using such a small interval won't reveal any meaningful climatic trends, but it is useful for diagnosing how well the models perform on short, noisy time scales.

PD - Interesting post. Since I understand that this blog, in part, is targeted at middle school and High school students, perhaps this post would be a good post to use to evaluate critical thinking skills and evaluating if the chosen test is the right one. Therefore, I pose the following questions for any younger (or older) readers:

1. What is the one clear conclusion that can be drawn from comparing the average GMST for 1981-2000 to the average for 2001-present?

2. Is the calculated trend of 0.18C/decade the trend for the time period from 2001 to the present? If your answer is yes, why? If your answer is no, what is the time period for the calculated trend?

3. Did the test used in this post answer the question at hand as posed at the beginning of the post?

I think the answers to my questions can be found by a careful and thoughtful reading of the post. I will be back tomorrow to give what I believe are the correct answers.

Bob: It will be interesting to see what you think the answers are. I hope they also include consideration of the referenced what is climate note.

Chad: In the recent what is climate note, I looked at finding an average. That's because we normally consider climate to be the average (or other statistics) over a period of time. Computing trends of weather ... well, what exactly does that mean? I'm starting with, first we have to know what a thing

isbefore we look at its trend. While it is awfully easy to compute trends and correlations (same problem fundamentally), it is often not meaningful to do so. It looks, from some browsing I did yesterday, like the trend computations stabilize in the same way, and same time scale, as computing the averages. But I'll make my own computation in the same way, as it does have an important difference in approach (at least philosophically).Evaluating how well a model performs must be done in a context of what it was built to do. If a model is built to compute the right weather statistics over 20-30 year time scales, that's your thing to evaluate. If it isn't designed to predict hurricanes, then marking it down for not giving the right predictions about Fay and Gustav is wrong.

Folks do test models and data sources outside their original planned purposes. The MSU was not designed for climate purposes, but after 20 or so years of working on it, people have found ways to make it an interesting data source even for climate study.

Raven: I'm mystified as to what you think my 'argument' is. All I wrote was that a certain effort to test certain climate projections didn't make a meaningful test. I've also noted that it doesn't mean that there can't be a meaningful test, nor that one can conclude that the projections are correct given that test's failure. In the note you're responding to here, I make an illustration of a test that could be made meaningful. It happened, which surprised me, that the projections actually passed.

Next questions would include whether they'd pass if we made the test more rigorously, and what, exactly, it is that's tested when looking in this way.

As to the IPCC itself -- they're a starting point, not gospel. If you don't like something they do, or don't, well, that's no skin off my nose. I haven't been involved in a single report. I'm not in automatic agreement with what they say. On the other hand, I'm also not in automatic disagreement with what they say either. If someone makes a bad test (bad for a number of different reasons), then it's a bad test. If someone makes a good test that comes to the same conclusion as the bad test, then (finally) the conclusion is meaningful.

Honestly, what Lucia has done is EXACTLY what we need to be teaching children in junior and senior high school to do.

See a hypothesis. Get actual data collected after the hypothesis is made. Test the accuracy of the hypothesis using basic statistical methods.

I have no problem with advocacy, but you are setting a very poor example for secondary school students studying science (what I understood to be your target audience).

[*]If you don't think that there is a broad based consensus that irrespective of further emissions, the world will heat up by roughly .2 degrees centigrade per decade between 2000 and 2020, then you simply need to be better informed. Lucia's list of citations is but a tiny fraction of a percent of the places where such a claim has been made. It would frankly be difficult to find a model or consensus climate scientist who disagrees with this prediction.

No, not exactly, because she didn't exactly do what you are describing.

See a hypothesis does not include rewrite what your source said (roughly 0.2 is not 0.2 exactly, for instance). Test a hypothesis does not include testing a hypothesis about kumquats with observations of pomegranates.

Weather and climate are not the same things. See my what is climate from Wednesday for some on what happens when you ignore this.

It is, yet again, strange to see myself accused of 'advocacy', when all I did was say that a particular source didn't make a good test.

Since you're commenting here, what are the problems with my version of testing a hypothesis (her quote of IPCC) vs. observations (NCDC)? It does arrive at a very different conclusion. But such is life. What is wrong with the

methodthat gives it that conclusion? It's a far simpler test than hers, more in keeping with your desire for tests using basic statistical methods.PD -

As indicated, I am back to give what I believe are the correct answers to the questions I posed yesterday.

The first question was "

What is the one clear conclusion that can be drawn from comparing the average GMST for 1981-2000 to the average for 2001-present?Simply put, the correct conclusion from this analysis is that it has been warmer since 2001 than during the preceding two decades. The

technicallycorrect way of stating the conclusion might be something like "Based on NCDC's estimates of the global mean temperature, the average global temperature anomaly for the period from 2001-July 2008 was greater than the average temperature anomaly for the period from 1981 through 2000."The second question was

"Is the calculated trend of 0.18C/decade the trend for the time period from 2001 to the present? If your answer is yes, why? If your answer is no, what is the time period for the calculated trend?"The answer to the first part of the question is no, 0.18C/decade is not the trend from 2001 to the present. Although the method used is not a standard statistical method for calculating trends, it is not unreasonable and the result is not remarkably different than other methods such as linear regression. However, if I am interested in the trend since 2001, data prior to 2001 is irrelevant. Suppose, for example, I was interested in Derek Jeter's batting average for the period from 2001 to the present. Is there any reason to use any of his batting averages in the 90s for that calculation? Another easy to viusal analogy is suppose I climbed Mayan pyramid (steep sides with a flat top). At this point in time, I have climbed to the top of the pyramid (let's say 300 steps up) and walked almost all the way across the flat top (let's say 100 steps). If I calculate a trend of how much I have risen using all 400 steps, it will clearly be up. However, that overall trend does not accurately characterize the flat trend of my last 100 steps.

I believe that, technically speaking, the calculated 0.18C/decade trend applies to the period from 1990.5 to 2004.8 although it is not markedly different from the trend calculated using linear regression for the entire period from 1981-2008.

My final question was

"Did the test used in this post answer the question at hand as posed at the beginning of the post?That question was: "whether the IPCC projection (not forecast) of 0.2 C/decade for the first two decades of the century could be rejected by observations currently in hand."The answer to my question is NO, the test used in this post did not answer the question at hand. Why? Because the test used in this post did not calculate a trend for the first 7 1/2 years of the first decade of this century which could then be compared to the IPCC projection of about 0.2C/decade for the first two decades of this century.

Now to the broader question of whether the IPCC projection of 0.2C/Decade for the first tow decades of the century can be rejected by the observations currently in hand. I believe that, based on the essentially flat trend since 2001, we can say that currently, the temperature trend is not consistent with the IPCC projection. Further, it does appear relatively unlikely that the global mean temperature will rise by 0.2C for the first decade of this century, or even the 0.1C/decade which IPCC indicated was expected even if GHG levels were capped at the 2000 levels. However, sharp temperature increases over the next two years could allow the actual temperature change to become consistent with the IPCC projections.

As to the followup comment about what is climate, as a geologist by training, I am willing to go with the rather fuzzy definition of climate being what you expect, weather is what you get. Insisting on using a 20 or 30 year average (what's the basis for that timeframe?!?)doesn't seem well supported, especially if your parameter of interest is consistently increasing or decreasing over that 30 year timeframe.

Cheers,

Bob North

Since you're commenting here, what are the problems with my version of testing a hypothesis (her quote of IPCC) vs. observations (NCDC)?There are three main problems with your test: 1) You aren't testing forecasting abiilty, 2) you aren't testing during the correct period and 3) you have chosen a test know to have low statistica power, when a test with higher power exist.

Here are details on each:

1)Your test of the model accuracy uses data that existed before the specific models used to create the forecast were created and run. Anyone could 'predict' JFK's assassination in, say, 1967. The hard thing would be to predict the assasination in 1960 or even just one day before it occurred.

Because it is easy to predict the past, it is standard to test predictive accuracy using out of sample data.

In the case of AR4 projections, that means data at least after 2001. (Model runs prior to 2001 were used in the TAR projections.)

2) Even in the AR4, the trend of 2C/century is not projected for the period 1990-2000. And, I'm not just quibbling about the meaning of "about". The 2 C/century is projected for the first few decades of

thiscentury. So, you are testing a trend that is projected to apply after 2001 using data from, when... 1990?In contrast, I am testing the IPCC hypothesis against data collected in where the projections apply: the early part of this century.

3) The other thing that is wrong with your

methodis that you are selecting a method with unnecessarily low statistical power.It is well known that failing to disprove a theory or prediction doesn't prove the theory or prediction. You recognize that in your comment. Most people do.

What is also fairly known is that

given the same amount of datasome tests have lower statistical power than others. At a chosen confidence level, (say 95%) both tests will incorrectly "falsify" a null hypothesis at the same rate (i.e. 5%)However, if you use a test with lower statistical power you will

failto falsify a theory or prediction that isfalsetoo often. That is: You will conclude there is not enough data todisprovea hypothesis that is wrong. Failing to disprove am incorrect hypothesis is always the major risk when we have little data. If this is due to lack of data, that's just the way it is. The remedy is to obtain more data.But failing to falsify an incorrect null hypothesis because we picked a test with

low poweris the fault of the analysis.To obtain correct answers at the highest possible rate, at a chosen confidence level, you want to pick the statistical test with the highest power.

At any given time, the t-test has higher power than the test you came up with. So, the t-test on the slopes is preferred over your test even if the two deficiencies above did not apply to your test. The fact that you with low power exist, and don't falsify a theory does not invalidate the result from a test with

higher power.You can read a little about statistical power at wikipedia. I discussed beta error here

On the kumquats issue: why you persist in ignoring the fact that I compare projections of surface temperature to obseration of surface temperature (GISSTemp, HadCrut and NOAA NCDC) mystifies me. Why you persist with "about 2C/century" not meaning 2C/century also mystifies me. Reading chapter 10 of the WG1 report of the AR4 or calculating the values from the underlying runs makes it clear "about" in the AR4 mean 2.1 C/century to 2.2 C/century. So, my rounding down

favorsthe models. If I stuck with 2.1 C/century to 2.2 C/century, the projections would look worse. Of course you can keep repeating you found a sentence that said "about", but people who are interested in learning what the author actually said will read the IPCC AR4, my blog, Real Climate and many other sources, and recognize the author of the IPCC AR4 really did project 2 C/century for the period where I apply my hypothesis tests.Bob, Lucia,

Same error from both of you, and neither seems to have read the note that I suggested be read -- http://moregrumbinescience.blogspot.com/2008/08/what-is-climate-2.html.

The error is that neither of you is making a test against a

climatevariable. Monthly averages aren't climate, nor is even an annual average.To test whether a

climateprojection is good or bad, you have to test it againstclimateobservations.If you want to look at climate trends, then you first have to find out what the climate was at one time, then make your best estimate of what climate is at some other time, and find the trend line between the two

climatefigures.In that previous note, I illustrated why it is you want at least 15 years, and preferably 20-30, to decide a value of climate. Using only the recent 7.6 years gives us an unreliable climate estimate. It is unreliable not for statistical reasons but for physical reasons. Whether it

lookslike the statistics are 'good' for this period -- ignoring what we know physically of the system -- we know that this is a short period.If baseball statistics were better-known, the Jeter example could be particularly good, except that you got the wrong end of the stick. Someone could hypothesize, for instance, that hitters' batting averages tail off in the second half of their career. (Need some fine-tuning on this to make it more directly analogous.) In normal baseball-speak, this means that their averages (climate) in the second half would be generally lower than in the first half. You don't test this statement meaningfully by only looking at the second half averages for whether the 'trend' is going down.

In that vein, it would be quite remarkable for a hitter to slump in the last quarter of the season and for his average

notto drop. Yet you note that you believe (I haven't computed it either, but for eyeball purposes it seems ok) that climate has 'slumped' (rate of increase in temperature has dropped) these past 7.6 years, but, nevertheless, if the last 27.6 years were used, you'd get the same trend as my naive method here (which also turns out to give the same trend as a least squares through the first 20 years only). Adding over 1/3rd new data that has a different characteroughtto change the result. If it doesn't, that suggests it doesn't really have a different character. Your Mayan example shows this. 300 steps up (steep steps! let's say a slope of 1), and then 100 steps flat. After those 100 steps flat, your average slope for the trip has dropped to 3/4. Different character than the first 300, so different average over the longer period. Conversely, if your slope is 1 for the first 200 steps, and still 1 after 276, I'll have to conclude that nothing much was different in the last 76, at least on average.Last Bob note: In the note you haven't read yet, I illustrated why 20-30 years. My illustration looked only at average, but the program I wrote has since been modified to show the variance as well, and it seems to have the same character. You (being a technical guy) can easily write up a version to apply that concept to what period is needed for the slope to stabilize. I'll be doing it myself one of these days. From some casual looking about, it seems that it'll wind up also being 15 or so years for the low bound.

Lucia: Statistical power is indeed an important concept, thanks for giving folks some links. To apply the t-test, however, you have to know how many degrees of freedom you have. One thing we (on the science) know is that there are far fewer degrees of freedom than number of months. Applying a t-test with the wrong degrees of freedom will give you misleading answers, whether to accept a hypothesis that you should have rejected because you overestimated the degrees of freedom, or to reject one you should accept because you underestimated. This is yet another part of why my statistician friend said "First, understand your system."

There are many varieties of kumquat. The one at hand is that you compare non-climate observations to climate projections. Did the AR4 models predict the temperature in my back yard yesterday afternoon? No. (At least I doubt it -- we were well below normal yesterday.) Does this 'falsify' the models? No -- noontime temperatures in my back yard aren't what they're trying to predict, they're trying to do climate. Did they predict hurricane Katrina? No. Do you 'falsify' them for that? No, they're not hurricane forecast models. And so on.

Pinatubo, on the other hand, provided a test that could have sent the modellers back to the drawing board. The models

aresupposed to be able to do a fair job of coping with volcanic eruptions. Shortly after it erupted, GISS published a paper with their projections of the next 2-3 years, as affected by the eruption. They did well in projecting the timing, magnitude, and duration of the Pinatubo effects.One thing we (on the science) know is that there are far fewer degrees of freedom than number of months. Applying a t-test with the wrong degrees of freedom will give you misleading answers, whether to accept a hypothesis that you should have rejected because you overestimated the degrees of freedom, or to reject one you should accept because you underestimated. This is yet another part of why my statistician friend said "First, understand your system."Of course using the wrong number of degrees of freedom gives the wrong answer. Had you actually

readthe article you criticized, you would be perfectly aware I did not use the number of months as the degrees of freedom. I reduced the degrees of freedom using four separate methods:1&2) I used two different standard methods to correct for red (AR1) red noise. (This method is previously discussed by Tamino here Note he tests a theory about a trend in GMST using 91 months of data-- exactly the number I use in my test.)

3&4) I applied a new adjustment to account for white measurement noise overlaid on red noise. These are special cases, and I have not fully documented them. (They widen the uncertainty intervals.)

As your closing statement in that paragraph: I advise you take your statistician friends advice to heart, read a bit and gain some understanding of the system yourself. Also, you might wish to learn what my highschool teachers told me: "Read and understand the contents of an article before you criticize it".

I'm not going to address the rest of what you said. It's misguided and foolish, but a full discussion buried in blog comments is a waste of time. Discussion with someone criticizing what they clearly have not read is an even bigger waste. Good luck with your blog.

Lucia: the article whose method was being criticized here is mine. One thing which I didn't do here was to apply the t-test you suggested. Another thing I didn't do, which is required to apply the t-test, is to find meaningful estimates of the number of degrees of freedom. There are other points of non-rigor in my note.

Again, you misread.

Raven: (not posted) Please read a text on climate before making assertions about weather and climate being indistinguishable. One good text, if somewhat advanced (you should be comfortable with at least vector calculus and preferably partial differential equations) is

Physics of Climateby J. P. Peixoto and A. H. Oort, AIP, 1992. As you mention chaos, you'll appreciate the fact that the foreward was by Edward Lorenz.Penguindreams,

I think you are thinking too much into whether Lucia's test is a "good test".

She is comparing a trend for 7 or 8 (whichever, I can't subtract) years from an ensemble model used by the IPCC, with the temperature data over 7 or 8 years. This is a fair test of the ensemble.

I would direct your attention to a post on RealClimate which addresses what the models tell us. If direct your attention to the update at the bottom of the post, you will see that Gavin points out that the period can be tested.

Anon, thanks for the link to Gavin's article. After reading it, and the 400+ comments, though, I don't see him saying that something like the rankexploits test would be good.

Since I was given to believe that rankexploits was making a serious, rigorous test, I looked at it against that standard. The more so as it talks about 'falsifying' peer-reviewed research.

There are indeed some good tests that can be done with only 7-8 years of data. rankexploits doesn't present one such. Nor is my simple example here a

goodtest. If you're willing to accept things that aren't good tests, then one such is necessarily on par with another one. Two not-good tests give wildly different results. Maybe we should look for a good test?One thing Gavin does, which I said (over at rankexploits) needed to be done for a good test is to make the test while acknowledging that the models themselves have variability -- both within-model and between models. The IPCC figures show this, and there are 55 ensemble members to work with. Their output is available, links given at Realclimate. Rankexploits told me to do it myself instead.

From Gavin's update:

For 7 year trends (beginning of 2001 to end of 2007), the model spread is approximately N(0.2,0.24) in deg C/dec - a little wider than the 8 year trends seen in the figure and there are 10 model simulations with negative trends.Gavin's notation is to say that the model trend, all models and model runs considered, over that period has a mean of about 0.2 and a standard deviation of 0.24 deg C/decade. Rankexploits did not test that hypothesis, instead testing a hypothesis of N(0.2,0.000).

To translate to something more familiar, consider professional basketball players. Suppose I said that they averaged 6'8" (2.00 m) with a standard deviation of 3" (7.5 cm). You then come along and get a sample of 16 professional basketball players, and your sample has an average of 6'4" (1.90 m) and standard deviation of 2", 5.0 cm. (Your sample was heavy on guards and short forwards.) If you do as rankexploits did, you ignore that there's a range of heights in the pool and reject the hypothesis that the population average is 6'8" (2 m). If you allow for the fact that there is variability in the population, you look at it and conclude that you may have pulled out a sample heavy in guards. Maybe the hypothesis (my figure for pro basketball player heights) really is wrong. But you need more data to make that conclusion.

More data to make that conclusion. Maybe, say, use 20 years for your climate test rather than 7? At 20 years, the standard deviation in Gavin's note has dropped from 0.2 or so (C per decade) to 0.09. It isn't zero, and never will be. But by taking the longer period, the uncertainty in both the model projections and in the observations drops. The more it drops, the stronger your conclusions can be.

Take another look at http://moregrumbinescience.blogspot.com/what-is-climate-2.html and see what happens to the averages as we change the length of the averaging period. At 7 years (total period, 42 months before and after) the average can be quite different from what you'll see at 20 years. But 20 years gives you pretty similar results to 15 or 30 or things in between.

n.b.: If someone does the detailed calculation on my basketball player example and finds that the exact numbers I made up don't give the conclusion I did, juggle the standard deviations until they do.

Aside: n.b. is from the Latin "nota bene", 'good note' (very loosely translation). Something of a heads up, in other words.

Thank you for your reply.

There are two issues in your response.

1. Is the sample size large enough? I don't much care about whether we are discussing climate, player height, or the price of tea in china. The size of the sample does not a priori cause a comparison to be invalid. You can however use the type two (beta) error to determine the probability that the comparison is invalid.

2. Is the only valid test one which compares the precision and accuracy of two samples?

I'll take a cynical approach to the question. Let's assume that the answer is yes. If I were modelling or assembling a model ensemble, being unable to increase the accuracy I could decrease precision and lower the probability of the model or ensemble from being falsified.

Now my turn for an example: Say I I give a piece of paper to a group of children and tell each to draw with free hand a circle between 4 and 6 inches in diameter. The less skillful the child, the more variance their model of the circle will be.

Say one child scribbles the entire page, is his model of a circle consistant with a real world circle with a 4 to 6 inch diamter? If we require a test which uses the variance of the child's model, it is. In fact, it is impossible to say it is not a circle.

See how many things can be involved in making a good test? And we're only scratching the surface here.

1) Whether the sample size is large enough is certainly a problem. I've been saying that for some time. In making your beta test, though, you're assuming that we know what the noise characteristics are. That's rather an assumption. Now you can make several different assumptions and hope that one of them represents reality, but you've still got the outstanding question of whether any of them

dorepresent reality. (And if so, how well, etc, etc.)2) Your example is no help at all. Staying with the cynical side, however, you suggest that it's possible that the modellers could simply avoid rejection vs. observation by ensuring that the model variability was high enough to avoid such rejection.

Actually, this wouldn't work even if we were paranoid enough to think they were trying to. Same as you can test averages, you can test variations. Reality has interannual variation, as do the models. So you look to see whether the model variances are unreasonable vs. the observed variances. (And now you have yet another batch of work to do to figure out how long a series is enough and how confident you are about the observations, etc. It looks like 20-30 years is what you want for this, too.)

That's on the side of an individual model run. Ok, staying paranoid, how about the hypothesis that they cook the ensembles to ensure that the ensembles are widely strewn. Well, then you look at how the ensembles are constructed and whether unphysical things are present. Ensembles within a model are generally constructed by perturbing the initial conditions given to that model. (If the physics inside the model are changed, it's normally considered a different model; perhaps only a sub-type, but different.)

If you'd like to explore this latter possibility, the realclimate article linked to above has a link to the model runs.

I give this one an extraordinarily low likelihood of being true. It's too hard to make the models do what we want. And they're much too computationally expensive for would-be conspirators to simply run a huge batch and cherry-pick a selection that gives the 'desired' conclusion. Even if it

werepossible, scientists are the worst personality type possible for a conspiracy. Not for nothing the phrase 'herding cats' gets used so often.> Would you still insist that

> the models cannot be falsified

That's not a casual misunderstanding

or misquite of your point here.

You'll need to teach rhetoric and

logic along with climate to parse the

comments.

Post a Comment