Comparing Climate Time Series. Part II: A Multivariate Test

. This paper proposes an objective criterion for deciding if two multivariate time series come from the same stochastic process. Importantly, the test accounts for correlations in both space and time. The basic idea is to ﬁt each multivariate time series to a Vector Autoregressive (VAR) Model, and then test the hypothesis that the parameters of the two models are equal. In the special case of a ﬁrst-order VAR model, the model is a Linear Inverse Model (LIM), and the test constitutes a difference-in-LIM test. One application of this test is to decide whether simulations from dynamical models are consistent observations. 5 This test is applied to the question of whether climate models generate realistic variability of annual mean North Atlantic Sea Surface Temperature (NASST). Given the disputed origin of multidecadal variability in NASST (e.g., it could be forced by anthropogenic aerosols or arise naturally from internal variability), the time series are ﬁltered in two different ways appropriate to the two driving mechanisms. In either case, only a few climate models out of three dozen are found to be consistent with the observed NASST. In fact, by comparing time series from every possible pair of climate models, it is shown that climate 10 models not only differ from observations, but also differ signiﬁcantly from each other, unless the models come from the same institution. These model-to-model and model-to-observation discrepancies raise considerable challenges for reaching a consensus on the relative role of internal variability and forcing in decadal ﬂuctuations of NASST.


15
A basic question in climate modeling is whether a given model realistically simulates observations. This question can be addressed only in a statistical sense, since climate variables contain unpredictable weather variability. In the special case of a single random variable and independent samples, this question can be addressed by applying standard tests of equality of distributions, such as the t-test, F-test, Kolmogorov-Smirnov test, etc. However, in many climate studies, multiple variables are of concern, and the associated time series are serially correlated. 20 The above question is particularly acute in the context of North Atlantic Sea Surface Temperature (NASST) variability. The North Atlantic is an area of enhanced decadal predictability and thus a prime candidate for skillful predictions on multi-year time scales (Kushnir, 1994;Griffies and Bryan, 1997;Marshall et al., 2001;Latif et al., 2004Latif et al., , 2006Keenlyside et al., 2008).
However, the dominant mechanisms of North Atlantic variability remain unclear. Some argue that North Atlantic variability is a manifestation of internal variability and that its multi-decadal swings may be misattributed to external forcing, leading to 25 overestimates of climate sensitivity (DelSole et al., 2011;Tung and Zhou, 2013). Others argue that these multi-decadal swings are forced mostly by anthropogenic aerosols, with internal variability playing a weak role (Booth et al., 2012). However, simulations in which external forcing drives multi-decadal North Atlantic variability exhibit significant inconsistencies with observations, in terms of upper-ocean content, surface salinity, and spatial SST patterns (Zhang et al., 2013). Such discrepancies cast doubt on the claim that aerosols are a prime driver of twentieth-century North Atlantic variability. On the other hand, the 30 alternative theory that North Atlantic multi-decadal variability is dominated by internal variability also exhibits inconsistencies.
Specifically, climate models give inconsistent estimates of the magnitude and spatial structure of predictability in the North Atlantic (Branstator et al., 2012). However, the latter estimates are based on univariate analyses and not accompanied by a significance statement. As a result, it has proven difficult to reject a model based on its predictability or multi-year behavior, and models with widely different levels of predictability continue to remain equally plausible. 35 One of the most well-developed techniques for comparing time series is optimal fingerprinting (Bindoff et al., 2013). However, optimal fingerprinting is concerned mostly with detecting and attributing forced variability. Whether the variability that remains after removing forced variability is consistent with the model's internal variability typically is assessed through the residual consistency test of Allen and Tett (1999), which is a test of total variance. As a result, each individual component could have the wrong variance yet the total sum could match observations. For this reason, the residual consistency check is 40 not a stringent test of consistency of internal variability.
To advance this topic, we propose an objective criterion for deciding if two multivariate time series come from the same stochastic process. To simplify the problem, we consider only second-order stationary processes, in which the mean and covariance function are invariant to translations in time. Although non-stationarity is important in climate, a rigorous framework based on stationarity provides a starting point for comparing non-stationary processes. Also, climate models exhibit large One approach to testing equality of covariance functions is to express the function as a covariance matrix, and then apply standard tests of equality of covariance matrices (Anderson, 1984). However, these tests assume independent data and hence do 50 not account for serial correlation. This paper proposes an approach to testing equality of covariance functions that accounts for serial correlation. Our basic idea is to fit data to a Vector Autoregressive Model (VAR) model, then test equality of parameters.
The resulting test is the multivariate generalization of the test proposed by , which is part one of this paper series. Techniques for diagnosing differences in VAR models will be discussed in part three of this paper series.
2 Derivation of the Test 55 We first derive an exact test for equality of parameters for multivariate regression models, and then exploit asymptotic theory to derive a test for equality of multivariate AR processes.
We begin by considering the two multivariate regression models

60
where N 1 and N 2 are the respective sample sizes, S is the number of time series (variables), M the number of random predictors, and j 1 and j 2 are N 1 -and N 2 -dimensional vectors of ones. The dimensions of the above quantities are The matrices E 1 and E 2 are independent and distributed as In this paper, we are interested in comparing variability, hence we test hypotheses without restricting the intercept terms µ 1 and µ 2 . Accordingly, our null hypothesis is Let B 0 and Γ 0 denote the common regression parameters and noise covariance matrix under H 0 , respectively.
It turns out that inferences based on models (1)-(2) are identical to inferences based on models where the intercept terms µ 1 and µ 2 have been dropped and X, Y contain centered variables (i.e., the mean of each column 75 is subtracted from that column). For simplicity then, we hereafter consider models (3)-(4), with the understanding that X, Y refer to centered variables, and one extra degree of freedom is subtracted from each model to account for dropping µ 1 and µ 2 .
It proves convenient to define Maximum Likelihood Estimates (MLEs) of the regression parameters are (Anderson, 1984, chapter Regression models (3) and (4) involve a total of 2S + 2M S + S(S + 1) parameters. The model under H 0 involves 2S + M S + S(S + 1)/2 parameters. If H 0 is true, then standard asymptotic theory indicates that the deviance statistic D(1, 2) has a 105 chi-squared distribution with degrees of freedom equal to the difference in the number of parameters between models: This distribution holds for both −2 log Λ total and −2 log(l 0 /l 1 /l 2 ) because the two are asymptotically equivalent. A more accurate sampling distribution for D(1, 2) can be derived using Monte Carlo techniques, but for our data, the resulting significance thresholds differ from that of (13) by less than 4%. Since the difference is small and does not affect any of our conclusions, the 110 asymptotic distribution (13) is satisfactory for our purposes, and the Monte Carlo technique is not discussed further.
A Vector Autoregressive Model (VAR) for a random vector z t is of the form where A 1 , . . . , A p are constant S × S matrices, and t is a Gaussian white noise process with covariance Γ. For a realization of length N , the above model can be written in the form (3) or (4), or equivalently Y T = B T X T + E T , using the identifications The MLEs of B and Γ, and the associated maximized likelihood, have been investigated extensively (see Brockwell and Davis, 1991;Lütkepohl, 2005;Box et al., 2008). For finite samples of serially correlated processes, the exact sampling distribution of 120 these quantities are prohibitively complicated, even for Gaussian distributions (Brockwell and Davis, 1991, ch 6). However, for large sample sizes, these distributions are consistent with those derived from linear regression theory derived above (Brockwell and Davis, 1991, theorem 8.1.2, sec. 8.9). Accordingly, we use (13) to test equality of VAR models for large sample size. We now apply our test to compare North Atlantic sea surface temperature (NASST) variability between models and observations. In particular, we focus on comparing multi-year internal variability. The question arises as to how to extract internal variability from observations. There is considerable debate about the magnitude of forced variability in this region, particularly the contribution due to anthropogenic aerosols (Booth et al., 2012;Zhang et al., 2013). Accordingly, we consider two possibilities: that the forced response is well represented by (1) a second-order polynomial or (2) a ninth-order polynomial over 130 1854-2018. These two assumptions will be justified shortly.
If NASST were represented on a typical 1 • × 1 • grid, then the number of grid cells would far exceed the available sample size. Accordingly, some form of dimension reduction is necessary. Given our focus on multi-year predictability, we consider only large-scale patterns. Accordingly, we project annual-mean NASST onto the leading eigenvectors of the Laplacian over the the sensitivity of our conclusions to assumptions about the source of multi-decadal variability. Note that only Laplacians 1 and 2 are sensitive to this assumption; time series for Laplacians three or higher are nearly the same in the two cases.
To match the dimensions of the observational time series, time series of length 165 years are drawn from the control simula-160 tions. Then, the time series are split in half (82 and 83 years), and time series from the earlier half are compared to those in the later half. The idea is that if a time series comes from the same source, then the null hypothesis is true and the method should detect a difference at the expected rate. Thirty-six CMIP5 models have PI control simulations of this length.

Model Selection
The process of deciding which variables to include in a regression model is called model selection. Here, model selection This criterion is consistent with small-sample corrected versions of Akaike's Information Criterion (AIC) for multivariate 170 model selection. In terms of the regression model (3), the criterion is where Σ 1 is the MLE of the covariance matrix of Y 1 and The procedure is to select the subset of X-and Y-variables that minimize MIC.   redundant and can be dropped, the test remains exactly the same as for a model in which all regression parameters are non-zero.
The main adverse consequence of including more variables than necessary is a loss of power (i.e., the test is less able to detect a difference when such a difference exists). In this study, we are more concerned with capturing important dependency structures than with loss of power, hence our choice to include the maximum number of Laplacian eigenvectors selected by MIC.
To be clear, our method can be applied to arbitrary VAR models. Our choice of VAR model is merely a compromise designed 190 to capture significant large-scale spatial dependencies while also adequately modeling temporal correlations.

Results
The deviance between ERSST 1854-1936 and 82-year segments of pre-industrial control simulations are shown in fig. 4. Also shown is the deviance between ERSST 1854-1936and ERSST 1937-2018 (first item on x-axis). The latter deviance falls below the 5% threshold and hence indicates no significant difference between two halves of ERSST, regardless of polynomial fit. This 195 result is consistent with the hypothesis that ERSST is a stationary VAR process after removing either a second-or ninth-order polynomial. Only one CMIP5 model is consistent with ERSST when a 2nd-order polynomial removed, and only two CMIP5  models are consistent with ERSST when a 9th-order polynomial is removed. We conclude that the vast majority of CMIP5 models do not generate realistic simulations of ERSST.
To explore the sensitivity of the above results to the number of Laplacians, deviances based on 10 Laplacian eigenvectors are  crepancies between models, and between models and observations, become more detectable as smaller-scale spatial structures are taken into account.

205
It is instructive to change the reference time series used for comparison. For instance, instead of comparing to ERSST, we compare each time series to time series from the CanESM2 model. The result of comparing every time series from the first half to every time series in the second half is summarized in fig. 6. The plotted numerical value is the deviance divided by its 5% critical value. Light and dark grey shading indicate significant differences at the 5% and 1%, respectively. Note that the diagonal is unshaded, indicating that the test correctly concludes no difference in VAR model when time series come from the 210 same CMIP5 model. Some CMIP5 models differ significantly from all other models (e.g, MRI, inmcm4), indicating that these models not only are inconsistent with observations, but also inconsistent with other CMIP5 models.
Interestingly, models from the same institution tend to be indistinguishable from each other (e.g,. GISS, NCAR, MPI, CMCC). This result is consistent with previous studies indicating that models developed at the same institution show more similarities with each other than with models developed at different institutes (Knutti et al., 2013). The deviances for 10 215 Laplacian vectors is shown in fig. 7. With more Laplacians comes the ability to detect more differences, so that almost all models are found to differ significantly from each other, unless they come from the same institution.
An alternative approach to summarizing dissimilarities between models is through dendograms (Knutti et al., 2013;Izenman, 2013). An example is shown in fig. 8 and constructed according to the following iterative procedure. First, each model is assigned to its own cluster. Next, the pair of VAR models with the smallest deviance are merged together to form a new 220 cluster. This clustering is indicated by a two-pronged "leaf" whose length equals the deviance. After VAR models are linked, an "agglomeration" rule is used to measure dissimilarity relative to a cluster. There is no unique choice for the agglomeration rule. We choose a standard one called complete-linkage clustering, in which the "distance" between two sets of clusters A and B is defined to be the maximum deviance between elements of each cluster: If this distance exceeds the significance threshold, then at least one deviance between elements of the clusters is known to be significant. Then, the smallest distance between any pair of clusters is linked together again. This process repeats until all models have been merged into a single cluster. The x-axis shows the distance and the y-axis shows the data sources. Each source name occurs twice because two time series are drawn from that source. As can be seen, the shortest leaves are associated with time series from the same source, or from models developed at the same institution.

230
The broad conclusions drawn from the dendogram in fig. 8 are similar to those drawn by Knutti et al. (2013). However, the dendogram constructed in Knutti et al. (2013) was based on the Kullback-Leibler (KL) divergence. Importantly, KL divergence requires estimating covariance matrices, but a significance test for this measure was not examined in Knutti et al. (2013). The dendogram developed here is attractive in that it is based on a rigorous statistical measure of dissimilarity that has a well-defined significance threshold that accounts for sampling variability and dependencies in space and time.

235
A perennial question is whether models should be weighted equally when making multi-model projections of the future.
Such weighting schemes lie outside the scope of this paper, but a related question is whether there exists a relation between a model's past performance and its predictions of the future. To investigate this question, we plot a model's deviance from ERSST against that model's Equilibrium Climate Sensitivity (ECS). ECS is the equilibrium change in annual mean global surface temperature following a doubling of atmospheric CO2 concentration. The result is shown in fig. 9. The figure also 240 shows a least-square line fit to the data points. The slope is statistically significant (p = 0.011) under the standard assumptions of independent and identically distributed residuals, which of course are dubious for our problem. Nevertheless, the correlation is negative, indicating that models that best simulate the past tend to have larger ECS. Extrapolating to zero deviance yields an ECS of 4.8 • C, although the probabilistic meaning (e.g.,confidence interval) of this result is unclear.

245
This paper proposed an approach to deciding if two multivariate time series come from the same stochastic process. The basic idea is to fit each time series to a Vector Autoregressive Model, and then test if the parameters of the models are equal. The likelihood ratio test for this problem and the associated sampling distributions were derived. This derivation leads to a deviance statistic that measures the difference between VAR processes and can be used to rank models based on their "closeness" to the VAR process inferred from observations. The test accounts for correlations in time and correlations between variables. In the 250 special case of a first-order VAR model, the model is a LIM and the test is effectively a "difference-in-LIM" test.
The test was illustrated by using it to compare annual mean North Atlantic SST variability in models and observations. When compared to observations, almost every CMIP5 model differs significantly from ERSST. This conclusion holds regardless of whether a second-or ninth-order polynomial in time is regressed out. Thus, our conclusion does not depend on whether multidecadal NASST variability is assumed to be forced or internal. By applying a hierarchical clustering technique, we 255 showed that time series from the same model, or from models from the same institution, tend to be clustered together, and in many cases differ significantly from other clusters. Our results are consistent with previous claims (Pennell and Reichler, 2011;Knutti et al., 2013) that the effective number of independent models is smaller than the actual number of models in a multi-model ensemble.
Recently, Mann et al. (2021) argued that there exists no compelling evidence for internal multidecadal oscillations. In partic-260 ular, oscillations seen in proxies of pre-industrial temperature can be explained as an artifact of volcanic activity that happens to project on the multidecadal frequency band. Under this hypothesis, variability in NASST is due primarily to external forcing, and removing a ninth-order polynomial would remove forced variability better than a second-order polynomial. Nevertheless, even after removing a ninth-order polynomial, our results show that most models are inconsistent with observations. Thus, work remains to be done to improve model simulations of internal variability.