Probabilistic evaluation of competing climate models

Climate models produce output over decades or longer at high spatial and temporal resolution. Starting values, boundary conditions, greenhouse gas emissions, and so forth make the climate model an uncertain representation of the climate system. A standard paradigm for assessing the quality of climate model simulations is to compare what these models produce for past and present time periods, to observations of the past and present. Many of these comparisons are based on simple summary statistics called metrics. In this article, we propose an alternative: evaluation of competing climate models through probabilities derived from tests of the hypothesis that climate-model-simulated and observed time sequences share common climate-scale signals. The probabilities are based on the behavior of summary statistics of climate model output and observational data over ensembles of pseudo-realizations. These are obtained by partitioning the original time sequences into signal and noise components, and using a parametric bootstrap to create pseudo-realizations of the noise sequences. The statistics we choose come from working in the space of decorrelated and dimension-reduced wavelet coefficients. Here, we compare monthly sequences of CMIP5 model output of average global near-surface temperature anomalies to similar sequences obtained from the well-known HadCRUT4 data set as an illustration. Copyright statement. The author’s copyright for this publication is transferred to California Institute of Technology.

In principle, statistical inference about climate (both historical and future) using observations, is 122 based on the conditional distribution, P(Y Y Y |Z Z Z h ), which is, via Bayes' Rule, The right-hand side of Eq. (1) can be written as, where the first equality assumes, quite naturally, that historical data depend only on the historical 125 climate, not the future climate. from the ensemble of climate model runs is represented by the vector X X X † : X X X † f |X X X † h represents the probability distribution of Y Y Y f |Y Y Y h , and how well the probability distribution of It remains to determine how the relationship between X X X lh and Y Y Y h can be quantified in order to 158 model P(1 l = 1). One obvious way would be through D D D l = (X X X lh −Y Y Y h ), and to assign P(1 l = 1) 159 proportional to the probability that D D D l falls into some restricted region around the origin in high-160 dimensional space. Operationally, this would likely be difficult because of the high dimensionality 161 and the ad hoc choice of a restricted region. The distance D l = ||D D D l || (or some weighted version) 162 could be used instead, and we could assign P(1 l = 1) ∝ P(D l ≤ d), where d is a positive real num-163 ber. However, taking the (possibly weighted) norm is a huge simplification that allows bad fidelity 164 in one portion of the time sequence to offset good fidelity in another, which can lead to undesirable 165 results. Moreover, these sequences exhibit temporal dependence, and so any methodology and its 166 associated theory needs to incorporate this. 167 One way to account for temporal dependence is to transform the sequences so that the trans-168 formed values are decorrelated; in spectral analysis, this is sometimes called pre-whitening. In 169 wavelet analysis, the Discrete Wavelet Transform (DWT) can be used: where W is a square, orthonormal matrix (i.e., W W = I I I) that acts on a generic time sequence X X X 171 resulting in the wavelet coefficients C X X X (Percival and Walden 2006). The choice of wavelet basis 172 functions (father and mother wavelets) will determine the form of W .
Further, we assume that the noise terms, e lh (t) and e 0h (t), are all mutually independent with means 187 equal to zero but potentially unequal variances, E e 2 lh (t) = σ 2 l (t) and E e 2 0h (t) = σ 2 0 (t).

188
The wavelet decomposition is a decorrelator, just like the usual Fourier spectral decomposi-189 tion, but wavelets easily capture local behavior through functions that are of compact support,  Thus, our evaluation of model l will be through comparison of α l ,β l to the null value (0, 1) for 208 each l = 1, . . . , L.

209
Of course, we would prefer to compare {X X X lh } directly to the true climate Y Y Y h , but a noisy version C e e e lh , and C e e e 0h are the vectors of wavelet coefficients of Y Y Y s h , Y Y Y n h , e e e lh , and e e e 0h , 214 respectively. The terms in parentheses in Eq. (13) cannot be separately identified, so we consider 215 them to be residual errors.

216
The key assumption that we shall make is that Z Z Z h can be denoised to leave behind only the 217 wavelet coefficients associated with climate signal, C Y Y Y h . LetJ be a constant,J ≤ J, that specifies 218 the number of coarse-scale wavelet-decomposition levels that define climate signal in the wavelet-219 level hierarchy. Let S (C X X X ,J) be a smoothing function that operates on C X X X by setting elements 220 corresponding to levels greater thanJ, to zero. So, and the corresponding smoothed time sequence is S(X X X,J) = W S (C X X X ,J). Our assumption is that 222 after smoothing, S (C Z Z Z h ,J) = C Y Y Y s h , the wavelet coefficients of the climate signal.

223
Climate model sequences {X X X lh } can be evaluated according to how well their wavelet coeffi-224 cients corresponding to levels 1, . . . ,J, reproduce those of C Z Z Z h . Define T (C X X X ,J) as a truncation 225 operator that deletes all elements of C X X X that correspond to levels greater thanJ. Then,
motivated by simple linear regression. Define, In what follows, we shall consider a test statistic based onα l andβ l . It is crucial to obtain Thus, the detrended series are: To prepare for the DWT, we padX X X l andZ Z Z so that they have lengths equal to T = 2 log 2 N , where 263 · is the ceiling function that returns the smallest integer greater than or equal to its argument.

264
To do this, we reflect the appropriate subsequences of components at the beginning and end of b. Estimating summary statistics

269
We perform wavelet decompositions, with J levels, onX X X l andZ Z Z h using a common wavelet basis.

283
This results in a p-value, which we interpret as a measure of compatibility of the model output with hypotheses, H l0 : (α l , β l ) = (0, 1), for l = 1, . . . , L. We use the test statistic, where K K K is the bootstrap covariance matrix of α * bl ,β * bl : b = 1, . . . , B , namely Specifically, the p-value associated with our test is estimated by We call this the "compatibility" of model l's output time sequence with the observational sequence  In this section, we demonstrate the methodology described in Section 3 by applying it to the 306 evaluation of monthly global average near-surface temperatures produced by 44 CMIP5 models. 307 We evaluate these against a benchmark observational data set used in a similar comparison pre-   Table 1 lists the 44 models used in this study and 342 the modeling centers that are responsible for them.  The second preprocessing step is to pad the sequences so that they have lengths equal to the next- . 409 We computed estimated regression coefficients using formulas given in in Section 3b. Results To generate an approximation to the sampling distribution of (α l ,β l ) under H 0l : (α l , β l ) = (0, 1), 417 we follow the prescription of Section 3c. We fit a wavelet model using J = 11 to the detrended,

447
To evaluate the l-th model, we require the proportion of resampled points that are further away, 448 in terms of the scaled squared distance Q l (given by Eq. (37)), from the point (0,1), than the red 449 point at (α l ,β l ) is from (0, 1). This is depicted in the right panel of Figure 5 by the proportion of 450 the histogram that is to the right of the red vertical line, which is the bootstrapped p-value.

457
In Table 3 and define 479 Figures 8 and 9 show the same information as that contained in Table 3, but in the form of sequence that would perform well against HadCRUT4.

504
To explore this, we computed both a differentially weighted model mean,X X X D = (X D (1),X D (2),

505
. . . ,X D (N)) , with normalized weights, and a uniformly weighted model mean,X X X U = (X U (1),X U (2), . . . ,X U (N)) with weights P U (1 l = 507 1) = 1/44. That is, Conclusions about the CMIP5 models themselves are as follows. Table 3 shows that, accord-  Finally, we are pursuing extensions in several areas. We believe that WiSE bootstrap could 558 provide a basis for a weighting scheme for multi-model ensembles or perturbed physics ensembles.

559
The probabilities P D (1 l = 1) can be used as marginal selection probabilities when drawing time 560 sequences from an ensemble in order to form a mean sequence. However, joint probabilities 561 29 would be required to define a probability structure that captures the dependence between models.

562
A more complex multiple-testing framework will be required to assign joint probabilities rather 563 than simple marginal probabilities.

564
There are natural extensions of the WiSE bootstrap to spatial and spatio-temporal contexts. Mov-565 ing from one-dimensional to two-dimensional wavelets would allow us to use the same technology 566 on maps as we have used here on time sequences. However, moving to three spatial dimensions, 567 three spatial dimensions with time, and multivariate settings may not be straightforward since 568 wavelets may not be suitable basis functions for these more complex problems. We are investigat-

569
ing the use of other kinds of basis functions in ongoing research.

570
andJ (the number of levels in the wavelet decomposition that constitute climate signal).

594
Here we use τ 2 = logT , which satisfies the two conditions above. Note that the same values 599μ l (t) =μ 0 (t) are used in Eqs. (A1) and (A2). Using the same values is required to enforce 600 the null hypothesis. 601 6. For b = 1, . . . , B, and a fixed l, obtain α * lb ,β * lb from X X X * bl and Z Z Z * b as follows.

602
(a) ObtainX X X * bl andZ Z Z * b by preprocessing X X X * bl and Z Z Z * b as specified in Section 3a.  We conducted a simulation study to understand the performance of our proposed hypothesis testing 614 method. We generated data from the two processes, Y 1t = S 1t + ε 1t and Y 2t = S 2t + ε 2t , for t = 615 32 1, . . . , 2 J . Here, the first series Y 1t acts as the "observations" and the second series, Y 2t , acts as 616 the "model output". In all the cases we consider below, {ε 1t } and {ε 2t } are mutually independent 617 and identically distributed as N(0,V ). However, we have conducted studies with heteroskedastic 618 noise, and the results do not change in substance from those reported below.

619
The signal components of both the processes Y 1t and Y 2t satisfy the framework we adopt in this 620 paper, namely where {A j,k } are a fixed set of wavelet basis functions, and γ 2 jk = α +β γ 1 jk , which directly models false; thus, a higher power is desirable.

650
In Figure B1, we present a selection of the power curves that we obtained from our simulations.     Fig. 3. Detrended, padded time sequence for the HadCRUT4 observational sequence (light gray).

765
The reconstructed sequence using wavelet levels 1 throughJ = 6 is shown by the thin black  Table 3. The 45 • line is shown in gray.

786
Symbols and colors vary in order to differentiate visually among models. . . . . . . . 52 787 Fig. 9. Scatterplot of corr l versus p l ; values are given in Table 3. The o45 • line is shown in gray.   Table 3. The o45 • line is shown in gray. Symbols and colors vary in order to differentiate visually among models.