12 Oct 2020
12 Oct 2020
Comparing climate time series – Part 1: Univariate test
 ^{1}Department of Atmospheric, Oceanic, and Earth Sciences, George Mason University, Fairfax, Virginia, USA
 ^{2}Department of Applied Physics and Applied Mathematics, Columbia University, New York, New York, USA
 ^{1}Department of Atmospheric, Oceanic, and Earth Sciences, George Mason University, Fairfax, Virginia, USA
 ^{2}Department of Applied Physics and Applied Mathematics, Columbia University, New York, New York, USA
Correspondence: Timothy DelSole (tdelsole@gmu.edu)
Hide author detailsCorrespondence: Timothy DelSole (tdelsole@gmu.edu)
This paper proposes a new approach to detecting and describing differences in stationary processes. The approach is equivalent to comparing autocovariance functions or power spectra. The basic idea is to fit an autoregressive model to each time series and then test whether the model parameters are equal. The likelihood ratio test for this hypothesis has appeared in the statistics literature, but the resulting test depends on maximum likelihood estimates, which are biased, neglect differences in noise parameters, and utilize sampling distributions that are valid only for large sample sizes. This paper derives a likelihood ratio test that corrects for bias, detects differences in noise parameters, and can be applied to small samples. Furthermore, if a significant difference is detected, we propose new methods to diagnose and visualize those differences. Specifically, the test statistic can be used to define a “distance” between two autoregressive processes, which in turn can be used for clustering analysis in multimodel comparisons. A multidimensional scaling technique is used to visualize the similarities and differences between time series. We also propose diagnosing differences in stationary processes by identifying initial conditions that optimally separate predictable responses. The procedure is illustrated by comparing simulations of an Atlantic Meridional Overturning Circulation (AMOC) index from 10 climate models in Phase 5 of the Coupled Model Intercomparison Project (CMIP5). Significant differences between most AMOC time series are detected. The main exceptions are time series from CMIP models from the same institution. Differences in stationary processes are explained primarily by differences in the mean square error of 1year predictions and by differences in the predictability (i.e., Rsquare) of the associated autoregressive models.
 Article
(3508 KB)  Companion paper
 BibTeX
 EndNote
Climate scientists often confront questions of the following types.

Does a climate model realistically simulate a climate index?

Do two climate models generate similar temporal variability?

Did a climate index change its variability?

Are two power spectra consistent with each other?
Each of the above questions requires deciding whether two time series come from the same stochastic process. Although numerous papers in the weather and climate literature address questions of the above types, the conclusions often are based on visual comparison of estimated autocovariance functions or power spectra without a rigorous significance test. Lund et al. (2009) provide a lucid review of some objective tests for deciding whether two time series come from the same stationary process. An additional test that was not considered by Lund et al. (2009) is to fit autoregressive models to time series and then to test differences in parameters (Maharaj, 2000; Grant and Quinn, 2017). Grant and Quinn (2017) showed that this test has good power and performs well even when the underlying time series are not from an autoregressive process. The latter test has been applied to such problems as speech recognition but not to climate time series. The purpose of this paper is to further develop this test for climate applications.
The particular tests proposed by Maharaj (2000) and Grant and Quinn (2017) test equality of autocorrelation without regard to differences in variances. However, in climate applications, differences in variance often are of considerable importance. In addition, these tests employ a sampling distribution derived from asymptotic theory and therefore may be problematic for small sample sizes. In this paper, the likelihood ratio test is derived for the more restrictive case of equality of noise variances, which leads to considerable simplifications, including a test that is applicable for small sample sizes. Further comments about how our proposed test compares with previous tests will be discussed below.
If the time series are deemed to come from different processes, then it is desirable to characterize those differences in meaningful terms and to group time series into clusters. Piccolo (1990) and Maharaj (2000) propose such classification procedures. Following along these lines, our hypothesis test suggests a natural measure for measuring the distance between two stationary processes that can be used for clustering analysis. For multimodel studies, this distance measure can be used to give a graphical summary of the similarities and differences between time series generated by different models. In addition, we extend the interpretation of such summaries considerably by proposing a new approach to diagnosing differences in stationary processes based on finding the initial condition that optimally separates onestep predictions.
In this section we review previous methods for comparing two time series. These methods are based on the theory of stochastic processes and assume that the joint distribution of the values of the process at any collection of times is multivariate normal and stationary. Although nonstationarity is a prominent feature in many climate time series, a stationary framework is a natural starting point for nonstationary generalizations. Stationarity implies that the expectation at any time is constant, and the secondorder moments depend only on the difference times (more precisely, this is called weaksense stationarity). Accordingly, if X_{t} is a stationary process, then the mean is independent of time,
and the timelagged autocovariance function depends only on the difference in times,
Because a multivariate normal distribution is fully characterized by its first and second moments, the stochastic process is completely specified by μ_{X} and the autocovariance function c_{X}(τ). Stationarity further implies that the autocovariance function is an even function of the lag τ. Following standard practice, we consider discrete time series where values are available at N_{X} equally spaced time steps ${X}_{\mathrm{1}},{X}_{\mathrm{2}},\mathrm{\dots},{X}_{{N}_{X}}$.
Now consider another stationary process Y_{t}, with mean μ_{Y} and autocovariance function c_{Y}(τ). The problem we consider is this: given sample time series ${X}_{\mathrm{1}},\mathrm{\dots},{X}_{{N}_{X}}$ and ${Y}_{\mathrm{1}},\mathrm{\dots},{Y}_{{N}_{Y}}$, decide whether the two time series come from the same stationary process. For stationary processes, we often are not concerned with differences in means (e.g., in climate studies, these often are eliminated through “bias corrections”), hence we allow μ_{X}≠μ_{Y}. Therefore, our problem is equivalent to deciding whether the two time series have the same autocovariance function, i.e., deciding whether
This problem can be framed equivalently as deciding whether two stationary processes have the same power spectrum. Recall that the power spectrum is the Fourier transform of the autocovariance function. We define the power spectrum of X_{t} as
The spectrum of Y_{t} is defined similarly and denoted as p_{Y}(ω). Because the Fourier transform is invertible, equality of autocovariances is equivalent to equality of spectra. Thus, our problem can be framed equivalently as deciding whether
Estimates of the power spectrum are based on the periodogram (Box et al., 2008).
The above hypothesis differs from hypotheses about a single process that are commonly tested with autocovariance functions or power spectra. For instance, the hypothesis of vanishing autocorrelation often is assessed by comparing the sample autocorrelation function to $\pm \mathrm{1.96}/\sqrt{N}$, where N is the length of the time series (e.g., Brockwell and Davis, 2002, chap. 1). In the spectral domain, the hypothesis of white noise often is tested based on the Kolmogorov–Smirnov test, in which the standardized cumulative periodogram is compared to an appropriate set of lines (e.g., Jenkins and Watts, 1968, Sect. 6.3.2). These tests consider hypotheses about one process. In contrast, our hypothesis involves a comparison of two processes.
Coates and Diggle (1986) derived spectraldomain tests for equality of stationary processes. The underlying idea of these tests is that equality of power spectra implies that their ratio is independent of frequency and therefore indistinguishable from white noise. This fact suggests that standard tests for white noise can be adapted to periodogram ratios. Coates and Diggle (1986) derive a second parametric test that assumes that the log ratio of power spectra is a loworder polynomial of frequency.
Lund et al. (2009) explored the above tests in the context of station data for temperature. They found that spectral methods have relatively low statistical power – that is, the methods are unlikely to detect a difference in stationary processes when such a difference exists. Our own analysis using the data discussed in Sect. 5 is consistent with Lund et al. (2009) (not shown). The fact that spectraldomain tests have less statistical power than timedomain tests is not surprising. After all, spectral tests are based on comparing periodograms in which the number of unknown parameters grows with sample size. In particular, for a time series of length N, the periodogram combines coefficients for sines and cosines of a Fourier transform into N∕2 coefficients for the amplitude. Thus, a time series of length N=64 yields a periodogram with 32 amplitudes; a time series of length N=512 yields a periodogram with 256 amplitudes; and so on. Because the number of unknowns grows with sample size, the sampling error of the individual periodogram estimates does not decrease with sample size. Typically, sampling errors are reduced by smoothing periodogram estimates over frequencies, but the smoothing makes implicit assumptions about the shape of the underlying power spectrum. In the absence of hypotheses to constrain the power spectra, the large number of estimated parameters results in low statistical power.
Lund et al. (2009) also develop and discuss a timedomain test. This test is based on the sample estimate of the autocovariance function
where ${\widehat{\mathit{\mu}}}_{X}$ is the sample mean of X_{t} based on sample size N_{X}. The analogous estimate for the autocovariance function of Y_{t} is denoted ${\widehat{c}}_{Y}\left(\mathit{\tau}\right)$. The test is based on differences in autocovariance functions up to lag τ_{0}. That is, the test is based on
In the case of equal sample sizes ${N}_{X}={N}_{Y}=N$, Lund et al. (2009) propose the statistic
where $\widehat{\mathbf{W}}$ is an estimate of the covariance matrix between autocovariance estimates, namely
and $\widehat{c}\left(\mathit{\tau}\right)$ is the pooled autocovariance estimate
The reasoning behind this statistic is lucidly discussed in Lund et al. (2009) and follows from standard results in time series analysis (see proposition 7.3.2 in Brockwell and Davis, 1991). Under the null hypothesis of equal autocovariance functions, and for large N, the statistic χ^{2} has an approximate chisquare distribution with τ_{0}+1 degrees of freedom. A key parameter in this statistic is the cutoff parameter K in the sum for $\widehat{\mathbf{W}}$. Lund et al. (2009) propose using $K={N}^{\mathrm{1}/\mathrm{3}}$ but acknowledge that this rule needs further study.
We have applied Lund et al.'s test to the numerical examples discussed in Sect. 5 and find that $\widehat{\mathbf{W}}$ is not always positive definite. In such cases, χ^{2} is not guaranteed to be positive and therefore does not have a chisquare distribution. The lack of positive definiteness sometimes can be avoided by choosing a slightly different K, but in many of these cases the resulting χ^{2} depends sensitively on K. This sensitivity arises from the fact that $\widehat{\mathbf{W}}$ is close to singular, so changing K by one unit can change χ^{2} by more than an order of magnitude. Conceivably, some modification of $\widehat{\mathbf{W}}$ or some rule for choosing K can remove this sensitivity to K, but without such modification this test is considered unreliable. Further comments about this are given at the end of Sect. 5.
Several authors have proposed approaches to comparing stationary processes based on assuming that the time series come from autoregressive models of order p, denoted AR(p). Accordingly, we consider the two AR(p) models
where the ϕs are autoregressive parameters, the γs are constants that control the mean, and the ϵ_{t}s are independent Gaussian white noise processes with zero mean and variances
By construction, X_{t} and Y_{s} are independent for all t and s. Our method can be generalized to handle correlations between X_{t} and Y_{s}, specifically by using a vector autoregressive model with coupling between the two processes, but this generalization will not be considered here. For the above models, there exists a onetoone relation between the first p+1 autocovariances $c\left(\mathrm{0}\right),\mathrm{\dots},c\left(p\right)$ and the parameters ${\mathit{\varphi}}_{\mathrm{1}},\mathrm{\dots},{\mathit{\varphi}}_{p},{\mathit{\sigma}}^{\mathrm{2}}$. This relation is expressed through the Yule–Walker equations and a balance equation for variance (Box et al., 2008). Therefore, equality of the first p+1 autocovariances is equivalent to equality of the AR(p) parameters. Furthermore, estimates of the parameters depend only on the first p+1 autocovariances. The remaining autocovariances of an AR(p) process $c(p+\mathrm{1}),c(p+\mathrm{2}),\mathrm{\dots}$ are obtained through recursion formulas that depend only on the first p+1 autocovariances. Importantly, the autocovariance function of an AR(p) process does not depend on γ; rather, γ controls the mean of the process. In climate studies, the mean often is removed from time series before any analysis, hence we do not require the means to be equal; i.e., we allow γ_{X}≠γ_{Y}.
In previous tests, different authors make different assumptions about the noise variance. Maharaj (2000) allows the noise in the two models to differ in their variances and to be correlated at zero lag. Grant and Quinn (2017) allow the noise variances to differ but assume the noises are independent at zero lag. In the climate applications we have in mind, differences in variance are important, hence we are interested in detecting differences in noise variances. Accordingly, our null hypothesis of equivalent stationary processes is the following:
The alternative hypothesis is that at least one of the above parameters differs between the two processes. Constraining processes to be AR(p) means that each process is characterized by p+1 parameters, in contrast to the unconstrained case in which the autocovariance function or power spectrum is characterized by an infinite number of parameters.
The likelihood ratio test for hypothesis H_{0} is derived in the Appendix. The derivation differs from that of Grant and Quinn (2017) primarily by including the hypothesis σ_{X}=σ_{Y} in the null hypothesis, which leads to considerable simplifications in estimation and interpretation. Furthermore, we propose a modification to correct for biases and a Monte Carlo technique for determining significance thresholds. We describe the overall procedure here and direct the reader to the Appendix for details. The test relies on sample estimates of the above parameters, so we define these first. The autoregressive parameters in Eq. (11) are estimated using the method of least squares, yielding the least squares estimates ${\widehat{\mathit{\varphi}}}_{\mathrm{1}}^{X},\mathrm{\dots},{\widehat{\mathit{\varphi}}}_{p}^{X},{\widehat{\mathit{\gamma}}}^{X}$. (The least squares method arises when maximizing the conditional likelihood, which is conditional on the specific realization of ${X}_{\mathrm{1}},\mathrm{\dots},{X}_{p}$ in the data. For large sample sizes, the conditional least squares estimates will be close to the familiar maximum likelihood estimates.) The noise variance for X_{t} is estimated using the unbiased estimator
where ${\mathit{\nu}}_{X}={N}_{X}\mathrm{2}p\mathrm{1}$ is degrees of freedom, computed as the difference between the sample size N_{X}−p and the number of estimated parameters p+1. This noise variance estimate is merely the residual variance of the AR model and is readily available from standard time series analysis software. Similarly, the method of least squares is used to estimate the AR parameters for Y_{t} in Eq. (12), yielding the least squares estimates ${\widehat{\mathit{\varphi}}}_{\mathrm{1}}^{Y},\mathrm{\dots},{\widehat{\mathit{\varphi}}}_{p}^{Y},{\widehat{\mathit{\gamma}}}^{Y}$ and the noise variance estimate
where ${\mathit{\nu}}_{Y}={N}_{Y}\mathrm{2}p\mathrm{1}$. Then, the test of H_{0} is based on the statistic
where
and the remaining variables are defined in Table 1. Under the null hypothesis H_{0}, it is shown in the Appendix that F_{σ} and F_{ϕ∣σ} have the following approximate distributions:
Furthermore, the two statistics are independent. Critical values for D_{σ} and D_{ϕ∣σ} are obtained individually from the critical values of the Fdistribution, taking care to use the correct onetailed or twotailed test (see Appendix, particularly Eq. A31). In principle, the exact sampling distribution of D_{ϕ,σ} can be derived analytically because D_{ϕ,σ} is the sum of two random variables with known distributions. However, this analytic calculation is cumbersome, whereas the quantiles of ${D}_{\mathit{\sigma}}+{D}_{\mathit{\varphi}\mid \mathit{\sigma}}$ can be estimated accurately and quickly by Monte Carlo techniques. Essentially, one draws random samples from ${F}_{{\mathit{\nu}}_{X},{\mathit{\nu}}_{Y}}$ and ${F}_{p,{\mathit{\nu}}_{X}+{\mathit{\nu}}_{Y}}$, substitutes these into Eqs. (18)–(19), evaluates D_{ϕ,σ} in Eq. (17), and then repeats this many times (e.g., 10 000 times). Note that the required repetitions do not grow with sample size N_{X} and N_{Y} since the Fdistributions can be sampled directly. The 5 % significance threshold for D_{ϕ,σ}, denoted D_{crit}, is then obtained from the 95th percentile of the Monte Carlo samples. Although a Monte Carlo technique is used to obtain critical values, the test still constitutes a test for small sample sizes.
The above test assumes that time series come from an AR(p) process of known order and are excited by Gaussian noise. Grant and Quinn (2017) use Monte Carlo simulations to show that the test is robust to nonGaussianity. Since our proposed method assumes normal distributions through Eqs. (20) and (21), its robustness to departures to Gaussian remains an open question and is a topic for future study. If the order of the process is unknown and has to be selected or the time series does not come from an autoregressive model (e.g., the process is a moving average process), then the type I error rate does not match its nominal value. In such cases, Grant and Quinn (2017) propose using some sufficiently large value of p, such as
where ⌊k⌋ denotes the largest integer smaller than or equal to k. The resulting AR(p^{*}) model can capture the first ${p}^{*}+\mathrm{1}$ autocovariances regardless of their true origin. Of course, there will be some loss of power when a test based on p^{*} is applied to a time series that comes from an AR(p) model with $p<{p}^{*}$. This criterion seems to work well in our examples, as discussed in more detail in Sect. 5.
If a difference in the AR process is detected, then we would like to interpret those differences. After all, such comparisons often are motivated to validate climate models, so if a discrepancy is found we would like to describe those differences in meaningful terms. The fact that the statistic D_{ϕ,σ} can be expressed as the sum of two independent terms suggests that it is natural to consider the two terms separately. The first term D_{σ} depends only on the difference in noise variance estimates ${\widehat{\mathit{\sigma}}}_{X}^{\mathrm{2}}$ and ${\widehat{\mathit{\sigma}}}_{Y}^{\mathrm{2}}$. The noise variance ${\mathit{\sigma}}_{X}^{\mathrm{2}}$ is the onestep prediction error of the AR model. After all, if the AR parameters in Eq. (11) were known exactly, then the conditional mean would be
where “S” stands for “signal”, and the onestep prediction error would be
Comparing the variance of onestep prediction errors is more statistically straightforward than comparing variances of the original time series because prediction errors are approximately uncorrelated, since they are the residuals of the AR model, which acts as a prewhitening transformation. In contrast, comparing variances of time series is not straightforward because of serial correlation. In practice, the residuals are only approximately uncorrelated because the parameter values are only approximate.
The second term D_{ϕ∣σ} vanishes when ${\widehat{\mathit{\varphi}}}_{X}={\widehat{\mathit{\varphi}}}_{Y}$ and is positive otherwise, hence it clearly measures the difference in AR parameters. To further interpret this term, it is helpful to partition the AR model for X_{t} into two parts, an unpredictable part associated with the noise ${\mathit{\u03f5}}_{t}^{X}$ and a predictable part S_{X}. The predictable part is the response to the “initial condition”
With this notation, ${\widehat{S}}_{X}={\mathit{u}}^{T}{\widehat{\mathit{\varphi}}}_{X}$ is the estimated predictable response of X_{t}. The estimated predictable response of Y_{t} can be defined analogously and denoted ${\widehat{S}}_{Y}$. The initial conditions u may be drawn from the stationary distribution of X_{t}, the stationary distribution Y_{t}, or some mixture of the two. In fact, the covariance matrix Σ_{HM} defined in Eq. (A40) is proportional to the “harmonic mean” of the sample timelagged covariance matrices for X_{t} and Y_{t}. Assuming that the initial condition u is drawn from a distribution with covariance matrix Σ_{HM} independently of u, then
If the initial condition for X_{t} and Y_{t} is contrived to be the same u, then
Comparing this expression to F_{ϕ∣σ} in Table 1 suggests that F_{ϕ∣σ} is a kind of signaltonoise ratio, where the “signal” is a difference in predictable responses for the same initial condition.
This difference in predictable responses, and hence F_{ϕ∣σ}, can be related to the difference in Rsquares of the individual time series. To see this, expand Eq. (27) as
The Cauchy–Schwarz inequality implies that
Hence, the above two expressions imply that
In the special case of equal noise variances, the above inequality becomes
where SNR_{X} and SNR_{Y} are the signaltonoise ratios of the two AR models:
Note that pF_{ϕ∣σ} is an estimate of the lefthand side of Eq. (31). Recall that Rsquare is defined as one minus the ratio of the noise variance to the total variance:
Because Rsquare and signaltonoise ratio are onetoone, inequality Eq. (31) implies the following: if the noise variances are identical, then a large difference in predictabilities (i.e., Rsquares) necessarily implies a tendency for F_{ϕ∣σ} to be large. This suggests that it may be informative to compare Rsquares. We use the sample estimate
which is always between 0 and 1.
It should be recognized that a difference in Rsquare is a sufficient, but not necessary, condition for a difference in predictable responses. This fact can be seen from the fact that Rsquare for an AR(p) model is
In particular, two processes may have the same Rsquare because the combination of ϕs yields the same Rsquare, but the specific values of the ϕs may differ. Although a difference in Rsquares is not a perfect indicator of F_{ϕ∣σ}, it is at least a sufficient condition and therefore worth examination.
On the other hand, if the Rsquares are the same, a difference in process still could be detected and should be explained. Simply identifying differences in ϕs would be unsatisfying because those differences have a complicated relation to the statistical characteristics of the process when p>1. Accordingly, we propose the following approach, which despite its limitations may still be insightful. Our basic idea is to choose initial conditions that maximize the mean square difference in predictable responses. To be most useful, we want this choice of initial condition to account for a multimodel comparison over M models. Accordingly, we define the mean square difference between predictable responses as
where
Our goal is to choose the initial condition u that maximizes Γ[u]. The initial condition u must be constrained in some way, otherwise Γ[u] is unbounded and there is no maximum. If the initial conditions are drawn from a distribution with covariance matrix Σ_{M}, then an appropriate constraint is to fix the associated Mahalanobis distance:
The problem is now to maximize u^{T}Au subject to the constraint ${\mathit{u}}^{T}{\mathbf{\Sigma}}_{M}^{\mathrm{1}}\mathit{u}=\mathrm{1}$. This is a standard optimization problem that is merely a generalization of principal component analysis (also called empirical orthogonal function analysis). The solution is obtained by solving the generalized eigenvalue problem
This eigenvalue problem yields p eigenvalues and p eigenvectors. The eigenvalues can be ordered from largest to smallest, ${\mathit{\lambda}}_{\mathrm{1}}\ge {\mathit{\lambda}}_{\mathrm{2}}\ge \mathrm{\dots}\ge {\mathit{\lambda}}_{p}$, and the corresponding eigenvalues can be denoted ${\mathit{u}}_{\mathrm{1}},{\mathit{u}}_{\mathrm{2}},\mathrm{\dots},{\mathit{u}}_{p}$. The first eigenvector u_{1} gives the initial condition that maximizes the sum square difference in predictable responses; the second eigenvector u_{2} gives the initial condition that maximizes Γ subject to the condition that ${\mathit{u}}_{\mathrm{2}}^{T}{\mathbf{\Sigma}}_{M}^{\mathrm{1}}{\mathit{u}}_{\mathrm{1}}=\mathrm{0}$; and so on. The eigenvalues ${\mathit{\lambda}}_{\mathrm{1}},\mathrm{\dots},{\mathit{\lambda}}_{p}$ give the corresponding values of Γ. The eigenvectors can be collected into the p×p matrix
Because the matrices A and Σ_{M} are symmetric, the eigenvectors can be chosen to satisfy the orthogonality property
Summing Γ over all eigenvectors gives
Note the similarity of the final expression to Eq. (27). This similarity implies that if only two models are compared and Σ_{M}=Σ_{HM}, then the solution exactly matches the variance of signal differences (27). Unfortunately, Σ_{HM} and the noise variance ${\widehat{\mathit{\sigma}}}^{\mathrm{2}}$ depend on $(m,{m}^{\prime})$, so it is difficult to generalize the optimal initial condition to explain all pairwise values of F_{ϕ∣σ}. We suggest using a covariance matrix that is the harmonic mean across all M time series:
where ${\widehat{\mathbf{\Sigma}}}_{m}$ is the sample covariance matrix of the initial condition Eq. (25) for the mth time series. Finally, we note that
Thus, λ_{k}∕Γ_{sum} is the fraction of sum total variance of signal differences explained by the kth eigenvector. Conceptually, each eigenvector u_{k} can be interpreted as an initial condition that has been optimized to separate predictable responses.
To illustrate the proposed method, we apply it to an index of the Atlantic Meridional Overturning Circulation (AMOC). This variable is chosen because it is considered to be a major source of decadal variability and predictability (Buckley and Marshall, 2016; Zhang et al., 2019). However, the AMOC is not observed directly with sufficient frequency and consistency to constrain its variability on decadal timescales. As a result, there has been a heavy reliance on coupled atmosphere–ocean models to characterize AMOC variability and predictability. The question arises as to whether simulations of the AMOC by different models can be distinguished and, if so, how they differ.
The data used in this study come from preindustrial control runs from Phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al., 2010). These control simulations lack yeartoyear changes in forcing and thereby permit a focus on internal variability without confounding effects due to anthropogenic climate change. We consider only models that have preindustrial control simulations spanning at least 500 years and contain meridional overturning circulation as an output variable. Only 10 models meet these criteria. An AMOC index is defined as the annual mean of the maximum meridional overturning streamfunction at 40^{∘} N in the Atlantic. A thirdorder polynomial over the 500 years is removed to eliminate climate drift. The AMOC index from the 10 models is shown in Fig. 1. Based on visual comparisons, one might perceive differences in amplitude (e.g., the MPI models tend to have larger amplitudes than other models) and differences in the degree of persistence (e.g., highfrequency variability is more evident in INMCM4 than in other models), but whether these differences are statistically significant remains to be established.
To compare autoregressive processes, it is necessary to select the order of the autoregressive model. As discussed in Sect. 3, we use Eq. (22), which for ν=1.0 gives ${p}^{*}=\mathrm{5}$. One check on this choice is whether the residuals are white noise. We find that AR(5) is adequate for all models except MRI, in the sense that the residuals reveal no serious departures from white noise according to visual inspection and by the Ljung–Box test (Ljung and Box, 1978). Another check is to compute Akaike's information criterion (AIC; Box et al., 2008) for the two halves of the data (i.e., 250 years). Some representative examples are shown in Fig. 2. As can be seen, AIC is a relatively flat function of model order, hence small sampling variations can lead to large changes in order selection. Indeed, the order selected by AIC is sensitive to which half of the data is used. Nevertheless, because AIC is nearly a flat function of order, virtually any choice of order beyond AR(2) can be defended. The highest order selected by AIC is p=11. While we have performed our test for AR(11), this order would be a misleading illustration of the method because one might assume that 11 lags are necessary to identify model differences. Thus, in the results that follow, we choose AR(5) for all cases and bear in mind that comparisons with MRI may be affected by model inadequacy.
The choice of AR(5) means that all information relevant to deciding differences in stationary processes is contained in the first six values of the sample autocovariance function $\widehat{c}\left(\mathrm{0}\right),\mathrm{\dots},\widehat{c}\left(\mathrm{5}\right)$. The sample autocovariance function for each AMOC index is shown in Fig. 3. Recall that the zerolag autocovariance $\widehat{c}\left(\mathrm{0}\right)$ is the sample variance. The figure suggests that the time series have different variances and different decay rates, but it is unclear whether these differences are significant. Note that a standard differenceinvariance test cannot be performed here because the time series are serially correlated. One might try to modify the standard Ftest by adjusting the degrees of freedom, as is sometimes advocated in the ttest (Zwiers and von Storch, 1995), but the adjustment depends on the autocorrelation function that we are trying to compare. An alternative approach is to prewhiten the data based on the AR fit and then test differences in variance, but this is exactly equivalent to our test based on F_{σ}.
An alternative approach to comparing stationary processes is to compare power spectra. Power spectra of the AMOC time series are shown in Fig. 4. The spectra have been offset by a multiplicative constant to reduce overlap (otherwise the different curves would obscure each other). While many differences can be seen, the question is whether those differences are significant after accounting for sampling uncertainties. The two spectral tests discussed in Sect. 2 find relatively few differences between CMIP5 models, although they do indicate that time series from INMCM4 and MPI differ from those of other CMIP5 models (not shown). Presumably, these differences arise from the fact that the INMCM4 time series is closer to white noise and that the MPI time series have larger total variance than those from other CMIP5 models.
To illustrate our proposed method, we perform the following analysis. First, the AMOC index from each CMIP5 model is split into equal halves, each 250 years long. Then, each time series from the first half is compared to each time series in the second half. Some of these comparisons will involve time series from the same CMIP5 model. Our expectation is that no difference should be detected when time series from two different halves of the same CMIP5 model are compared. To summarize the comparisons, we show in Fig. 5 a matrix of the biascorrected deviance statistic D_{ϕ,σ} for every possible model comparison. This statistic is a measure of the “distance” between stationary process. The two shades of gray indicate a significant difference in AR process at the 5 % or 1 % level. Values along the diagonal correspond to comparisons of AMOC time series from the same CMIP5 model. No significant difference is detected when time series come from the same model. In contrast, significant differences are detected in most of the cases when the time series come from different CMIP5 models. Interestingly, models from the same institution tend to be indistinguishable from each other. For instance, the MPI models are mostly indistinguishable from each other, and the CCSM and CESMBGC models (from the National Center for Atmospheric Research, NCAR) are mostly indistinguishable from each other. We say “mostly” because the difference depends on which half of the simulation is used in the comparison. For instance, a difference is detected when comparing the first half of the CESM1BGC time series to the second half to the CCSM4 time series, but not vice versa. The models have been ordered to make natural clusters easy to identify in this figure. Aside from a few exceptions, the pattern of shading is the same for AR(2) and AR(10) models (not shown), demonstrating that our conclusion regarding significant differences in AR process is not sensitive to model order. Also, the pattern of shading is similar if the comparison is based on only from the first half of the simulations, or only from the second half of the simulations (not shown), indicating that our results are not sensitive to sampling errors.
To visualize natural clusters more readily, we identify a set of points in a cartesian plane whose pairwise distances match the above distances as closely as possible. These points can be identified using a procedure called multidimensional scaling (Izenman, 2013). The procedure is to compute the deviance statistic between every possible pair of time series. Because there are 10 CMIP models and time series from each model is split in half, there are 20 time series being compared. Thus, the deviance statistic for every possible pair can be summarized in a 20×20 distance matrix. From this matrix, multidimensional scaling can find the set of points in a 20dimensional space that has this distance matrix. Moreover, it can find the points in a twodimensional space whose distance matrix most closely approximates the original distance matrix in some objective sense. In our case, 87 % of original distance matrix can be represented by twodimensional points. These points are shown in Fig. 6. Although 87 % of the distance matrix is represented in this figure, isolated points may have relatively large discrepancies. The average discrepancy is about 2 units, with the largest discrepancy being about 3.6 units between MRI and MPILR. An attractive feature of this representation is that the decision rule for statistical significance, ${D}_{\mathit{\varphi},\mathit{\sigma}}>{D}_{\text{crit}}$, is approximately equivalent to drawing a circle of radius $\sqrt{{D}_{\text{crit}}}$ around a point and identifying all the points that lie outside that circle. This geometric procedure would be exact in a 20dimensional space, but is only approximate in the 2dimensional space shown in Fig. 6. The figure suggests that the MPI models and INMCM4 form their own natural clusters. For other models, a particular choice of clusters is shown in the figure, but this is merely an example, and different clusters with different groupings could be justified. Note that all line segments connecting results from the same CMIP model are shorter than the circle radius, indicating no significant difference between time series from the same CMIP model.
It is interesting to relate the above differences to the autocovariance functions shown in Fig. 3. It is likely that INMCM4 differs from other models because its autocovariance function decays most quickly to zero. The MPI models are distinguished from the others by their large variances. Interestingly, note that the autocovariance function for NorESM1M is intermediate between that of two MPI models, yet the test indicates that NorESM1M differs from the MPI models. Presumably, the MPI models are indistinguishable because their autocovariance functions have the same shape, including a kink at lag 1, whereas the NorESM1M model has no kink at lag 1. This example illustrates that the test does not behave like a mean square difference between autocovariance functions, which would cluster NorESM1M with the MPI models.
It is worth clarifying how the above clustering technique differs from those in previous studies. Piccolo (1990) proposed a distance measure based on the Euclidean norm of the difference in AR parameters, namely
In contrast, our hypothesis test uses a Mahalanobis norm for measuring differences in AR parameters, where the covariance matrix is based on the sample covariance matrices of the timelagged data (see Eq. A44). While the Euclidean norm Eq. (45) does have some attractive properties as discussed in Piccolo (1990), it is inconsistent with the corresponding hypothesis test for differences in AR parameters. As a result, the resulting cluster may emphasize differences with large Euclidean norms that are insignificant, or may downplay differences in small Euclidean norms that are significant. In contrast, the distance measure used in our study is consistent with a rigorous hypothesis test. Maharaj (2000) proposes a classification procedure that is consistent with hypothesis testing, but that hypothesis test does not account for differences in noise variances.
Having identified differences between stationary processes, it is of interest to relate those differences to standard properties of the time series. Recall that our measure of the difference between stationary processes D_{ϕ,σ} is the sum of two other measures, namely D_{σ} and D_{ϕ∣σ}. Measure D_{σ} depends only on the noise variances, that is, it depends on the mean square error of a 1year prediction. In contrast, D_{ϕ∣σ} measures the difference in predictable responses of the process. As discussed in Sect. 4, we suggest examining differences in Rsquare. A graph of the noise variance plotted against Rsquare of each autoregressive model is shown in Fig. 7. The error bars show the critical distance for a significance difference in noise variance (lower left) and for a difference in Rsquare (upper right). First note that the projections of line segments onto the x axis or y axis are shorter than the respective error bars, indicating that differences in noise and differences in predictability are insignificant when estimated from the same CMIP5 model. Second, note that the relative positions of the dots are similar to those in Fig. 6. Thus, the noise variance and Rsquare appear to be approximate “principal coordinates” for distinguishing univariate autoregressive processes. Third, using noise variances alone, the MPI models would be grouped together, then INMCM4 would be grouped by itself, while at the bottom end CanESM2 and CCSM4 would be grouped together. These groupings are consistent with the clusters identified above using the full distance measure D_{ϕ,σ}, suggesting that differences in noise variances explain the major differences between stationary processes. Fourth, the AR models have Rsquare values mostly between 0.25 and 0.5. Time series from INMCM4 have the smallest Rsquare values while time series from NorESM1 have the largest Rsquare values.
An alternative approach to describing differences in AR parameters is to show differences in response to the same initial condition. We use the optimization method discussed in Sect. 4 to find the initial condition that maximizes the sum square difference in responses. The result is shown in Fig. 8. Essentially, the models separate by predictability – the order of the models from top to bottom closely tracks the order of the models based on Rsquare seen in Fig. 7. For this initial condition, INMCM4 damps nearly to zero in one time step, whereas NorESM1M decays the slowest among the models, consistent with expectations from Rsquare. This initial condition explains 82 % of the differences in response. Because optimal initial conditions form a complete set, an arbitrary initial condition can be represented as a linear combination of optimal initial conditions. If the AR models were used to predict an observational time series with covariance matrix Σ_{M}, then most differences between model predictions could be explained by the differences in response to a single optimal initial condition.
We end with a brief summary of our exploration of the χ^{2} statistic proposed by Lund et al. (2009). As mentioned earlier, the result is sensitive to the choice of K. However, instead of summing based on the cutoff parameter K, we summed over all possible lags, and set all sample autocovariances beyond lag p to zero. Using this approach, we find that the resulting χ^{2} statistic gives results very similar to ours for p=5, including clustering the MPI models with each other, and separating them from NorEMS1M. We have not investigated this procedure sufficiently to propose it as a general rule but mention it to suggest the possibility that some alternative rule for computing the sum Eq. (9) may yield reasonable results.
This paper examined tests for differences in stationary processes and proposed a new approach to characterizing those differences. The basic idea is to fit each time series to an autoregressive model and then test for differences in parameters. The likelihood ratio test for this comparison was derived in Maharaj (2000) and Grant and Quinn (2017). We have modified the test to correct for certain biases and to include a test for differences in noise variance. The latter test is of major interest in climate applications and leads to considerable simplifications in estimation and interpretation. Furthermore, the proposed test is applicable to small samples. In addition, we propose new approaches to interpreting and visualizing differences in stationary processes. The procedure was illustrated on AMOC time series from preindustrial control simulations of 10 models in the CMIP5 archive. Based on time series 250 years in length, the procedure was able to distinguish about four clusters of models, where time series from the same CMIP5 model are grouped together in the same cluster. These clusters are identified easily using a multidimensional technique. The clusters obtained by this method were not sensitive to the order of the autoregressive model, although the number of significant differences decreases with AR order due to the larger number of parameters being estimated. Further analysis shows that these clusters can be explained largely by differences in 1year prediction errors in the AR models and differences in Rsquare.
The proposed method can be used to compare any stationary time series that are well fit by an autoregressive model, which includes most climate time series. Thus, this method could be used to decide whether two climate models generate the same temporal variability. A natural question is whether this approach can be generalized to compare multivariate time series. This generalization will be developed in Part 2 of this paper. The method also could be used to compare model simulations to observations, provided that the stationarity assumption is satisfied. If nonstationarity is strong, then this method would need to be modified to account for such nonstationarity, such as adding exogenous terms to the AR model. The likelihoodratio framework can accommodate such extensions and will be developed in Part 3 of this paper.
In this Appendix, we derive a test for equality of parameters of autoregressive models based on the likelihood ratio test. The derivation is similar to that in Maharaj (2000) and Grant and Quinn (2017), except modified to test for differences in noise variance and to correct for known biases in maximum likelihood estimates. In addition, we show how the biascorrected likelihood ratio can be partitioned into two independent ratios and derive the sampling distributions for each.
We consider only the conditional likelihood, which is the likelihood function conditioned on the first p values of the process. The conditional likelihood approach is reasonable for large sample sizes and has the advantage that the estimates can be obtained straightforwardly from the method of least squares. Suppose we have a time series of length N_{X} for X_{t}. Then the conditional likelihood function for X_{t} is
(see Eq. A7.4.2b of Box et al., 2008). The maximum likelihood estimates of the parameters, denoted by ${\widehat{\mathit{\varphi}}}_{\mathrm{1}}^{X},\mathrm{\dots},{\widehat{\mathit{\varphi}}}_{p}^{X},{\widehat{\mathit{\gamma}}}^{X}$, are obtained from the method of least squares. The maximum likelihood estimate of ${\mathit{\sigma}}_{X}^{\mathrm{2}}$ is
Substituting this into the likelihood function Eq. (A1) and then taking −2 times the logarithm of the likelihood function gives
Similarly, suppose we have a time series of length N_{Y} for Y_{t}. Then, the corresponding likelihood function is
the maximum likelihood estimate of ${\mathit{\sigma}}_{Y}^{\mathrm{2}}$ is
and −2 times the logarithm of the likelihood function is
Because X_{t} and Y_{t} are independent, the likelihood function for all the data is the product L_{X}L_{Y}.
Under hypothesis H_{0}, the likelihood function has the same form as L_{X}L_{Y}, except there is only a single set of autoregressive parameters ${\mathit{\varphi}}_{\mathrm{1}},\mathrm{\dots},{\mathit{\varphi}}_{p}$ and a single noise variance σ. The corresponding likelihood function is therefore
The maximum likelihood estimates of ${\mathit{\varphi}}_{\mathrm{1}},\mathrm{\dots},{\mathit{\varphi}}_{p}$, denoted ${\widehat{\mathit{\varphi}}}_{\mathrm{1}},\mathrm{\dots},{\widehat{\mathit{\varphi}}}_{p}$, are obtained from the least squares estimates of the pooled sample. Again, these estimates are obtained easily by the method of least squares. The maximum likelihood estimate of the common variance σ^{2} is
The corresponding loglikelihood is
Finally, we compute the likelihood ratio, or equivalently, the difference in the loglikelihood functions. This difference (multiplied by −2) is called the deviance statistic and is
It is well known that maximum likelihood estimates of variance are biased. Accordingly, we define biascorrected deviance statistics before proceeding. This can be done by replacing sample sizes by degrees of freedom. Care must be exercised in such substitutions because replacing MLEs with unbiased versions will yield likelihoods that are no longer maximized, and therefore may yield deviance statistics that are negative. Our goal is to define a biascorrected deviance statistic that is nonnegative
To compute the degrees of freedom, note that the sample size of X_{t} is N_{X}−p, because the first p time steps are excluded from the conditional likelihood, and p+1 parameters are being estimated, so the degrees of freedom is the difference, ${N}_{X}\mathrm{2}p\mathrm{1}$. Similarly, the degrees of freedom for Y_{t} is ${N}_{Y}\mathrm{2}p\mathrm{1}$. Let the degrees of freedom be denoted
Accordingly, an unbiased estimate of ${\mathit{\sigma}}_{X}^{\mathrm{2}}$ is
and an unbiased estimate of ${\mathit{\sigma}}_{Y}^{\mathrm{2}}$ is
As will be shown below, the appropriate bias correction for ${\stackrel{\mathrm{\u203e}}{\stackrel{\mathrm{\u203e}}{\mathit{\sigma}}}}^{\mathrm{2}}$ is
Also, it proves helpful to define the unbiased estimate of the common variance σ^{2} as
Then, the biascorrected deviance statistic Eq. (A9) can be defined as
where
We now prove the D_{σ} and D_{ϕ∣σ} are nonnegative, independent, and have sampling distributions related to the Fdistribution. To show that D_{σ} is nonnegative, note that the ratio in Eq. (A16) is the weighted arithmetic mean over the weighted geometric mean of ${\widehat{\mathit{\sigma}}}_{X}^{\mathrm{2}}$ and ${\widehat{\mathit{\sigma}}}_{Y}^{\mathrm{2}}$. Consequently, we may invoke the AMGM inequality to show that D_{σ} is nonnegative and vanishes if and only if ${\widehat{\mathit{\sigma}}}_{X}={\widehat{\mathit{\sigma}}}_{Y}$. In this sense, D_{σ} measures the “distance” between ${\widehat{\mathit{\sigma}}}_{X}$ and ${\widehat{\mathit{\sigma}}}_{Y}$. Incidentally, had we used the uncorrected likelihoods, the resulting deviance statistic would vanish at ${\stackrel{\mathrm{\u203e}}{\mathit{\sigma}}}_{X}^{\mathrm{2}}={\stackrel{\mathrm{\u203e}}{\mathit{\sigma}}}_{Y}^{\mathrm{2}}$, which would give a biased measure of deviance.
To prove the remaining properties of D_{σ} and D_{ϕ∣σ}, we adopt a vector notation that is better suited to the task. Accordingly, let the AR(p) model Eq. (11) be denoted
where w_{X} is an (N_{X}−p)dimensional vector, Z_{X} is a $({N}_{X}p)\times p$ matrix, j is a (N_{X}−p)dimensional vector of ones, γ_{x} is a scalar, and the remaining terms have been defined previously.
Since H_{0} does not restrict γ_{X}, it proves convenient to eliminate this parameter by projecting onto the complement of j. Therefore, we multiply both sides of the equation by the projection matrix
This multiplication has the effect of eliminating the γ_{X} term and centering each column of w_{X} and Z_{X}. Henceforth, we assume that each column of w_{X} and Z_{X} has been centered. One should remember that the degrees of freedom associated with estimating the noise variance ${\mathit{\sigma}}_{X}^{\mathrm{2}}$ should be reduced by one to account for this precentering of the data. Similarly, the corresponding model for Y_{t} in Eq. (12) is written as
where w_{Y} is an (N_{Y}−p)dimensional vector, and the remaining terms are analogous to those in Eq. (A18). As for X, we assume that each column of w_{Y} and Z_{Y} is centered. The least squares estimates for ϕ_{X} and ϕ_{Y} are
and the corresponding sum square errors are
where $\Vert \mathit{a}\Vert ={\mathit{a}}^{T}\mathit{a}$ denotes the Euclidean norm of vector a. The sum square errors are related to the estimated variances as
Following the standard theory of least squares estimation for the general linear model, Brockwell and Davis (1991; Sect. 8.9) suggest that the sum square errors have the following approximate chisquared distributions:
Because X_{t} and Y_{t} are independent, the associated sum square errors are independent, hence under H_{0}, the statistic
has an Fdistribution with (ν_{X},ν_{Y}) degrees of freedom, as stated in Eq. (20). Furthermore, if H_{0} is true, then
has a (scaled) chisquared distribution with ν_{X}+ν_{Y} degrees of freedom; i.e.,
Straightforward algebra shows that D_{ϕ} in Eq. (A16) can be written as a function of F_{σ} as
This is a Ushaped function with a minimum value of zero at F_{σ}=1 and monotonic on either side of F_{σ}=1. Note that if X_{t} and Y_{t} were swapped, then the X and Y labels would be swapped and ${F}_{\mathit{\sigma}}\to \mathrm{1}/{F}_{\mathit{\sigma}}$, which yields the same value of D_{σ}. Let ${F}_{\mathit{\alpha},{\mathit{\nu}}_{X},{\mathit{\nu}}_{Y}}$ denote the critical value such that $F>{F}_{\mathit{\alpha},{\mathit{\nu}}_{X},{\mathit{\nu}}_{Y}}$ has probability α when F has an Fdistribution with (ν_{X},ν_{Y}) degrees of freedom. Then, the α100 % significance threshold for rejecting H_{0} is
where α on the righthand side is divided by 2.
The least squares estimate of the common regression parameters ϕ is
and the corresponding sum square error is
We now express this in terms of parameter estimates of the individual AR models. To do this, we invoke the standard orthogonality relation
It follows that
The difference between least squares estimates ${\widehat{\mathit{\varphi}}}_{X}$ and $\widehat{\mathit{\varphi}}$ is
while that between ${\widehat{\mathit{\varphi}}}_{Y}$ and $\widehat{\mathit{\varphi}}$ is
Substituting these expressions into Eq. (A36) and invoking Eq. (A28) yields
where
Σ_{HM} is proportional to the harmonic mean of two covariance matrices. This matrix arises naturally if we recall the fact that the sample estimates of the parameters have the following distributions:
Therefore, under H_{0},
hence
Again, by analogy with the standard theory of least squares estimation for the general linear model, SSE_{σ} and the quadratic form in Eq. (A39) are approximately independent of each other. Thus, if H_{0} is true, then
has an approximate Fdistribution with $(p,{\mathit{\nu}}_{X}+{\mathit{\nu}}_{Y})$ degrees of freedom, as indicated in Eq. (21). Using the identity
it follows that
and therefore
It is clear that F_{ϕ∣σ} is nonnegative, and therefore D_{ϕ∣σ} is nonnegative. Furthermore, D_{ϕ∣σ} is a monotonic function of F_{ϕσ}. Therefore, the α100 % significance threshold for rejecting H_{0} is
It remains to define the critical value for rejecting H_{0} based on D_{ϕ,σ}. D_{σ} and D_{ϕ∣σ} are independent because D_{σ} depends on the ratio of ${\mathit{\chi}}_{X}^{\mathrm{2}}$ and ${\mathit{\chi}}_{Y}^{\mathrm{2}}$, while D_{ϕ∣σ} depends on the sum ${\mathit{\chi}}_{X}^{\mathrm{2}}+{\mathit{\chi}}_{Y}^{\mathrm{2}}$, and the ratio and sum are independent as a consequence of Lukacs' proportionsum independence theorem (Lukacs, 1955). Therefore, we may sample from the Fdistributions (20) and (21) and then use Monte Carlo techniques to estimate the upper α100th percentile of ${D}_{\mathit{\varphi},\mathit{\sigma}}={D}_{\mathit{\sigma}}+{D}_{\mathit{\varphi}\mid \mathit{\sigma}}$ under H_{0}.
An R function for performing the test proposed in this paper is available from the authors upon request.
Both authors participated in the writing and editing of the manuscript. TD performed the numerical calculations.
The authors declare that they have no conflict of interest.
We thank Martha Buckley, Robert Lund, and two anonymous reviewers for comments that led to improvements in the presentation of this work. This research was supported primarily by the National Oceanic and Atmospheric Administration (NA16OAR4310175). The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.
This research has been supported by the NOAA (grant no. NA16OAR4310175).
This paper was edited by SeungKi Min and reviewed by two anonymous referees.
Box, G. E. P., Jenkins, G. M., and Reinsel, G. C.: Time Series Analysis: Forecasting and Control, WileyInterscience, 4th Edn., 2008. a, b, c, d
Brockwell, P. J. and Davis, R. A.: Time Series: Theory and Methods, Springer Verlag, 2nd Edn., 1991. a
Brockwell, P. J. and Davis, R. A.: Introduction to Time Series and Forecasting, Springer, 2002. a
Buckley, M. W. and Marshall, J.: Observations, inferences, and mechanisms of the Atlantic Meridional Overturning Circulation: A review, Rev. Geophys., 54, 5–63, https://doi.org/10.1002/2015RG000493, 2016. a
Coates, D. S. and Diggle, P. J.: Test for comparing two estimated spectral densities, J. Time Ser. Anal., 7, 7–20, 1986. a, b
Grant, A. J. and Quinn, B. G.: Parametric Spectral Discrimination, J. Time Ser. Anal., 38, 838–864, https://doi.org/10.1111/jtsa.12238, 2017. a, b, c, d, e, f, g, h, i
Izenman, A. J.: Modern Mutivariate Statistical Techniques: Regression, Classification, and Manifold Learning, Springer, corrected 2nd printing Edn., 2013. a
Jenkins, G. M. and Watts, D. G.: Spectral Analysis and its Applications, HoldenDay, 1968. a
Lee, Y.S.: Some Results on the Sampling Distribution of the Multiple Correlation Coefficient, J. Roy. Stat. Soc. Ser. B, 33, 117–130, 1971. a
Ljung, G. M. and Box, G. E. P.: On a measure of lack of fit in time series models, Biometrika, 65, 297–303, 1978. a
Lukacs, E.: A Characterization of the Gamma Distribution, Ann. Math. Statist., 26, 319–324, https://doi.org/10.1214/aoms/1177728549, 1955. a
Lund, R., Bassily, H., and Vidakovic, B.: Testing Equality of Stationary Autocovariances, J. Time Ser. Anal., 30, 332–348, 2009. a, b, c, d, e, f, g, h, i
Maharaj, E. A.: Cluster of Time Series, J. Classification, 17, 297–314, https://doi.org/10.1007/s003570000023, 2000. a, b, c, d, e, f, g
Piccolo, D.: A Distance Measure for Classifying ARIMA Models, J. Time Ser. Anal., 11, 153–164, https://doi.org/10.1111/j.14679892.1990.tb00048.x, 1990. a, b, c
Taylor, K. E., Stouffer, R. J., and Meehl, G. A.: A summary of the CMIP5 experimental design, available at: https://pcmdi.llnl.gov/mips/cmip5/experiment_design.html (last access: 6 October 2020), 2010. a
Zhang, R., Sutton, R., Danabasoglu, G., Kwon, Y.O., Marsh, R., Yeager, S. G., Amrhein, D. E., and Little, C. M.: A Review of the Role of the Atlantic Meridional Overturning Circulation in Atlantic Multidecadal Variability and Associated Climate Impacts, Rev. Geophys., 57, 316–375, https://doi.org/10.1029/2019RG000644, 2019. a
Zwiers, F. W. and von Storch, H.: Taking Serial Correlation into Account in Tests of the Mean, J. Climate, 8, 336–351, 1995. a
 Abstract
 Introduction
 Previous methods for comparing time series
 Comparing autoregressive models
 Interpretation of differences in AR processes
 Example: diagnosing differences in AMOC simulations
 Conclusions
 Appendix A: Derivation of the test
 Code availability
 Author contributions
 Competing interests
 Acknowledgements
 Financial support
 Review statement
 References
 Abstract
 Introduction
 Previous methods for comparing time series
 Comparing autoregressive models
 Interpretation of differences in AR processes
 Example: diagnosing differences in AMOC simulations
 Conclusions
 Appendix A: Derivation of the test
 Code availability
 Author contributions
 Competing interests
 Acknowledgements
 Financial support
 Review statement
 References