Climate models produce output over decades or longer at high spatial and temporal resolution. Starting values, boundary conditions, greenhouse gas emissions, and so forth make the climate model an uncertain representation of the climate system. A standard paradigm for assessing the quality of climate model simulations is to compare what these models produce for past and present time periods, to observations of the past and present. Many of these comparisons are based on simple summary statistics called metrics. In this article, we propose an alternative: evaluation of competing climate models through probabilities derived from tests of the hypothesis that climate-model-simulated and observed time sequences share common climate-scale signals. The probabilities are based on the behavior of summary statistics of climate model output and observational data over ensembles of pseudo-realizations. These are obtained by partitioning the original time sequences into signal and noise components, and using a parametric bootstrap to create pseudo-realizations of the noise sequences. The statistics we choose come from working in the space of decorrelated and dimension-reduced wavelet coefficients. Here, we compare monthly sequences of CMIP5 model output of average global near-surface temperature anomalies to similar sequences obtained from the well-known HadCRUT4 data set as an illustration.

The author's copyright for this publication is transferred to California Institute of Technology.

Climate models are computational algorithms that model the climate system. They simulate many complex and interdependent processes, yielding global or regional fields that evolve from the past to the present and into the future. The models allow scientists to understand the consequences of different assumptions about both the physics of the climate system and the forcings on it, including human influences. Climate models are also now viewed as decision-making tools because their projections of the future increasingly inform policy-making at the local, national, and international levels. The reliability of these future projections is central to both political and scientific debates about climate change.

Understanding climate and climate change is truly an international effort,
with modeling centers from around the world contributing model runs for the
most recent IPCC (Intergovernmental Panel on Climate Change) report. The
diversity of scientific opinion reflected by these multiple runs, which use
different initial conditions, parameterizations, and assumptions, is a key
strength of this very democratic approach to science. However, it also leads
to uncertainty because the results differ both across models and between runs
of the same model using different initial conditions and parameter settings.
To organize the effort, the Coupled Model Intercomparison Project (CMIP) was
established “to provide climate scientists with a database of coupled
GCM simulations under standardized boundary conditions,” and “to
attempt to discover why different models give different output in response to
the same input, or (more typically) to simply identify aspects of the
simulations in which “consensus” in model predictions or common problematic
features exists”

An enormous literature exists on the use of climate models, and ensembles of
model outputs, to make predictions of future climate conditions and quantify
reliabilities of those predictions. A basic strategy for quantifying
reliability of individual model runs is to assess their performance, over the
past and present, against observations.

Even if empirical accuracy is not sufficient to establish reliability of
future projections, there are other reasons why one might want to compare
climate model simulations to observations. First, there is diagnostic value
in understanding the ways in which climate model simulations agree or
disagree with observed conditions

Descriptive metrics are valuable as relative measures of the goodness of fit
of climate model simulations to observations. One can say that the RMSE,
against observations, of one model run is lower than that of another.
However, it is hard to know how to interpret metric values in an absolute
sense: how does the value of the metric relate to the probability that a
model is “right” in its representation of an observed physical process?
That question is malformed until we are precise about what right means. We
must articulate a specific hypothesis about the relationship between observed
and climate-model-simulated data; the model is deemed to be right if a formal
statistical test of that hypothesis is not rejected at an agreed-upon level
of significance. The

In this article, we present the statistical machinery for deriving compatibility measures between climate-model-simulated and observed time sequences. The null hypothesis we test is that the coarse-timescale coefficients of wavelet decompositions of the two sequences are the same. This allows for the possibility that, in the time domain, the sequences do not match exactly, but rather share longer-term, climate-scale behavior. Specifically, we break the time sequences of observations and climate-model-generated output into two components: low-frequency sequences described by coarse-level wavelet coefficients and high-frequency (possibly non-stationary) sequences described by an integrated autoregressive moving average (ARIMA) model. The coarse-level wavelet coefficients characterize decadal- and multi-decadal-scale oscillatory patterns, which we call “climate signals”, while the ARIMA processes characterize temporal dependence at finer timescales, and which we call “climate noise”. Our measure of similarity is the squared Euclidean distance between vectors of climate signal wavelet coefficients. The high-frequency climate noise might be interpreted as “weather” and does not contribute to this measure of similarity. To generate sampling distributions under the null hypothesis, we employ a parametric bootstrap in the time domain, based on the ARIMA model's fit to the climate noise. We demonstrate our method by computing the compatibilities of 139 CMIP5 historical model runs, of 44 different models, simulating monthly global near-surface temperature anomalies. We use the HadCRUT4 monthly global near-surface temperature anomaly data set as our observational benchmark.

The remainder of this paper is organized as follows. Section 2 describes the statistical model that relates model-generated output, observations, and true climate to one another. Section 3 defines the hypothesis testing framework that is crucial to our evaluation, along with the algorithm we use to implement it. In Sect. 4, we demonstrate our method and algorithm by evaluating the output of CMIP5 climate models against observations. Conclusions follow in Sect. 5.

Consider a single climate variable (e.g., global average near-surface
temperature) whose true value is generically denoted as

Assume that the true sequence

Direct comparison of

The wavelet decomposition is a decorrelator, just like the usual Fourier
spectral decomposition, but wavelets capture local behavior through functions
that are of compact support, multi-resolutional, and translational within a
resolution.

In wavelet analysis, the discrete wavelet transform (DWT) is

We augment the model given in Eq. (

The key assumption that we make is that

We now establish some important notation for further specifying the
statistical models. Write

We will apply the same wavelet transform to detrended versions of

To carry out a test of the hypothesis

The test statistics that we use are based on a weighted squared distance
between the climate-scale wavelet coefficients of

In what follows, it is crucial to obtain good estimates of the test
statistic's variance under

Starting with the original sequences,

Set

Obtain

Perform simple linear regression of

Perform simple linear regression of

Set

Set

If either

Set

Perform the

Compute

For a given climate model

Simulation of

Define

Simulate the

For

Obtain

Obtain

Perform wavelet decompositions on

Compute the simulated values,

Recall from Eq. (

The quantile at

In this section, we demonstrate our methodology described in the previous
sections by applying it to the evaluation of monthly global average
near-surface temperatures produced by 44 CMIP5 models. We evaluate these
against a benchmark observational data set used in a similar comparison
presented in the 2013 IPCC report, specifically in chap. 9, “Evaluation of
Climate Models”

In this subsection, we describe both the climate model outputs from CMIP5 and the global average near-surface temperature anomaly observations against which the CMIP5 climate models can be evaluated.

The CMIP5 experiments are broadly divided into near-term and long-term experiments, with
the long-term experiments designed specifically for model evaluation

We obtained a total of 139 time sequences of global monthly mean near-surface
air temperature anomalies, generated by 44 CMIP5 models, from the
KNMI (Royal Netherlands Meteorological Institute) Climate Explorer website
(

The collection of sequences produced by a given model is called an ensemble;
some models produced just one ensemble member, while other produced as many
as 10. Most sequences cover the period 1850–2005, although some start as
late as 1861 and some end as late as 2015. The common period that we use in
this case study is May 1918 through August 2003; a sequence of exactly 1024
months. Table

The 44 CMIP5 models used in this study.

Following

Figure

Monthly global average near-surface temperature anomaly time sequence plots for the first ensemble member of each of the 44 CMIP5 models (colors), and the HadCRUT4 observational sequence (red), for May 1918–August 2003. The black line is a 12-month running mean computed from the HadCRUT4 data.

We performed the steps described in Sect.

Model evaluation results for 139 time sequences generated by CMIP5
models in the historical experiment. Different models correspond to positions
along the

The DWT was applied to the detrended time sequences shown in Fig.

We used R's

Figure

Excluding CSIRO-Mk3-6-0/7 and CSIRO-Mk3-6-0/9 (due to failure of their
residual sequences to pass the white noise test), the eight remaining members
of the CSIRO-Mk3-6-0 ensemble are shown in Fig.

It is quite clear from this figure that the climate-signal time sequence for
member 5 is closer to that of HadCRUT4 than is the climate-signal sequence
for member 10. This is a reflection of the fact that the vector of
climate-scale wavelet coefficients for member 5 is closer, in the metric

The other part of the story comes from the characteristics of the
climate-noise time sequences that are left behind after accounting for trend
and climate signal. To obtain the null distribution of

The bootstrapped model sequence is the sum of (a) the model's trend, (b) the
HadCRUT4 climate-signal time sequence, and (c) a bootstrapped realization
from the a time series model fit to the climate model's climate noise
(

Time sequences of the CSIRO-Mk3-6-0 ensemble. The best- and worst-performing runs are members 5 and 10, respectively. They are shown in color. Members 1, 2, 3, 4, 6, and 8 are shown in grey. Members 7 and 9 are excluded from the analysis because they failed to meet required assumptions for ARIMA simulation.

CSIRO-Mk3-6-0 ensemble members' (excluding members 7 and 9) climate-signal time sequences after detrending, estimating the wavelet coefficients for the three coarsest levels of the wavelet decomposition, and transforming back to the time domain. The HadCRUT4 climate signal, defined and computed in the same way, is superimposed in red.

Impact of internal variability on bootstrapped climate signals.
Panels

Figure

The second reason why CSIRO-Mk3-6-0/5 performs better in our evaluation than
CSIRO-Mk3-6-0/10 is now evident: there is more variation in the
climate-signal time sequences of member 5's bootstrapped realizations than
in member 10's. This is a consequence of differences in the structures of
their climate-noise sequences; these structures are quantified by
Eqs. (

Climate-noise portions (in grey) of 10 bootstrapped time sequences
corresponding to the climate signals shown in Fig.

Null distributions, obtained by parametric bootstrapping, of eight
members of the CSIRO-Mk3-6-0 model.

This conclusion is driven home in Fig.

For a single time sequence, generated either by a climate model or an observational data source, we regard climate noise as a proxy for internal variability, and our method uses a parametric bootstrap to create pseudo-realizations from it. When added to the appropriate trend and climate-signal sequences, we thus create pseudo-realizations of full time sequences having the same statistical characteristics as their original counterparts. When uncertainties on observational data are not available, this may be a viable strategy for mimicking the aggregated effects of natural variability and observational error. When only a single member of a climate model ensemble exists, as is the case for some of the CMIP5 models in the historical experiment, the method may present a way of representing internal model variability. In fact, even when multiple ensemble members do exist, we argue that they are the results of purposeful perturbations of initial conditions and model parameters, and should be regarded as a source of between-member variability rather than within-member variability.

We have introduced a method, based on a hypothesis testing framework, to
determine the degree to which climate-scale temporal-dependence structures in
an observational time sequence are reproduced by climate-model-simulated time
sequences. For a given climate model, the degree of agreement, or
compatibility, is quantified by an empirical

Of course, such conclusions are predicated on the assumptions of the
hypothesis-testing framework. These include the underlying statistical models
for the time sequences, how we define “climate scale” in the context of
those models, the choice of test statistic, and how the sampling distribution
of the test statistic is simulated under the null hypothesis. We have made
necessary choices in this work that we believe to be reasonable, but others
are certainly possible. The choice of the wavelet decomposition level that
constitutes the boundary between climate signal and climate noise is
particularly important, since experiments have shown that it can change the
results substantially. Users of this methodology are free to choose
differently in accordance with their own scientific questions and opinions.
In fact, one could test hypotheses about specific temporal scales based on
wavelet coefficients corresponding to individual wavelet decomposition
levels. Other test statistics besides our

A crucially important methodological question about this approach is whether our strategy creates variabilities that are reasonable proxies for internal variabilities of a climate model and of the natural climate system. It begs the question of what, exactly, “internal variability” means. We offer here an alternative, or perhaps a compliment, to the usual and somewhat problematic definition that internal variability or uncertainty is captured by the spread of a multi-model or perturbed physics ensemble. At the very least, we hope this work will stimulate discussion on the topic.

Finally, there are natural extensions of this method to spatial and spatiotemporal contexts. Moving from one-dimensional to two-dimensional wavelets would allow us to use the same ideas on spatial maps as we have used here on time sequences. However, moving to three spatial dimensions, three spatial dimensions with time and multivariate settings may not be straightforward, since wavelet models may not be suitable in all cases. We are investigating the use of other basis functions and bootstrapping methods for these more complex settings.

The code used in Sect.

The data used in Sect.

The hypothesis testing strategy and formulation of
compatibilities was a collaborative effort among all authors, as was the
formulation of the statistical model that underlies the method. MH and SC
developed the wavelet-based model for time sequences, and the bootstrapping
framework for generating null distributions. AB carried out the analysis
reported in Sect.

The authors declare that they have no conflict of interest.

This research was carried out partially at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. It was supported by NASA's Earth Science Data Records Uncertainty Analysis and Advanced Information Systems Technology programs. In addition, N. Cressie's research was partially supported by a 2015–2017 Australian Research Council Discovery Grant (no. DP150104576), and S. Chatterjee's research was partially supported by the National Science Foundation (NSF) under grant nos. IIS-1029711 and DMS-1622483. The authors would like to thank Huikyo Lee and Stephen Leroy for their thoughtful and thorough comments on this work. US Government sponsorship is acknowledged. Edited by: Dan Cooley Reviewed by: two anonymous referees