Comparing climate time series – Part 1: Univariate test

This paper proposes a new approach to detecting and describing differences in stationary processes. The approach is equivalent to comparing auto-covariance functions or power spectra. The basic idea is to fit an autoregressive model to each time series and then test whether the model parameters are equal. The likelihood ratio test for this hypothesis has appeared in the statistics literature, but the resulting test depends on maximum likelihood estimates, which are biased, neglect differences in noise parameters, and utilize sampling distributions that are valid only for large sample sizes. This paper derives a likelihood ratio test that corrects for bias, detects differences in noise parameters, and can be applied to small samples. Furthermore, if a significant difference is detected, we propose new methods to diagnose and visualize those differences. Specifically, the test statistic can be used to define a “distance” between two autoregressive processes, which in turn can be used for clustering analysis in multi-model comparisons. A multidimensional scaling technique is used to visualize the similarities and differences between time series. We also propose diagnosing differences in stationary processes by identifying initial conditions that optimally separate predictable responses. The procedure is illustrated by comparing simulations of an Atlantic Meridional Overturning Circulation (AMOC) index from 10 climate models in Phase 5 of the Coupled Model Intercomparison Project (CMIP5). Significant differences between most AMOC time series are detected. The main exceptions are time series from CMIP models from the same institution. Differences in stationary processes are explained primarily by differences in the mean square error of 1-year predictions and by differences in the predictability (i.e., R-square) of the associated autoregressive models.


Introduction
Climate scientists often confront questions of the following types.
1. Does a climate model realistically simulate a climate index?
2. Do two climate models generate similar temporal variability?
3. Did a climate index change its variability?
4. Are two power spectra consistent with each other?
Each of the above questions requires deciding whether two time series come from the same stochastic process. Although numerous papers in the weather and climate literature address questions of the above types, the conclusions often are based on visual comparison of estimated auto-covariance functions or power spectra without a rigorous significance test. Lund et al. (2009) provide a lucid review of some objective tests for deciding whether two time series come from the same stationary process. An additional test that was not considered by Lund et al. (2009) is to fit autoregressive models to time series and then to test differences in parameters (Maharaj, 2000;Grant and Quinn, 2017). Grant and Quinn (2017) showed that this test has good power and performs well even when the underlying time series are not from an autoregressive process. The latter test has been applied to such problems as speech recognition but not to climate time series. The purpose of this paper is to further develop this test for climate applications. The particular tests proposed by Maharaj (2000) and Grant and Quinn (2017) test equality of autocorrelation without re-gard to differences in variances. However, in climate applications, differences in variance often are of considerable importance. In addition, these tests employ a sampling distribution derived from asymptotic theory and therefore may be problematic for small sample sizes. In this paper, the likelihood ratio test is derived for the more restrictive case of equality of noise variances, which leads to considerable simplifications, including a test that is applicable for small sample sizes. Further comments about how our proposed test compares with previous tests will be discussed below.
If the time series are deemed to come from different processes, then it is desirable to characterize those differences in meaningful terms and to group time series into clusters. Piccolo (1990) and Maharaj (2000) propose such classification procedures. Following along these lines, our hypothesis test suggests a natural measure for measuring the distance between two stationary processes that can be used for clustering analysis. For multi-model studies, this distance measure can be used to give a graphical summary of the similarities and differences between time series generated by different models. In addition, we extend the interpretation of such summaries considerably by proposing a new approach to diagnosing differences in stationary processes based on finding the initial condition that optimally separates one-step predictions.

Previous methods for comparing time series
In this section we review previous methods for comparing two time series. These methods are based on the theory of stochastic processes and assume that the joint distribution of the values of the process at any collection of times is multivariate normal and stationary. Although non-stationarity is a prominent feature in many climate time series, a stationary framework is a natural starting point for non-stationary generalizations. Stationarity implies that the expectation at any time is constant, and the second-order moments depend only on the difference times (more precisely, this is called weaksense stationarity). Accordingly, if X t is a stationary process, then the mean is independent of time, and the time-lagged auto-covariance function depends only on the difference in times, Because a multivariate normal distribution is fully characterized by its first and second moments, the stochastic process is completely specified by µ X and the auto-covariance function c X (τ ). Stationarity further implies that the auto-covariance function is an even function of the lag τ . Following standard practice, we consider discrete time series where values are available at N X equally spaced time steps X 1 , X 2 , . . ., X N X . Now consider another stationary process Y t , with mean µ Y and auto-covariance function c Y (τ ). The problem we consider is this: given sample time series X 1 , . . ., X N X and Y 1 , . . ., Y N Y , decide whether the two time series come from the same stationary process. For stationary processes, we often are not concerned with differences in means (e.g., in climate studies, these often are eliminated through "bias corrections"), hence we allow µ X = µ Y . Therefore, our problem is equivalent to deciding whether the two time series have the same auto-covariance function, i.e., deciding whether This problem can be framed equivalently as deciding whether two stationary processes have the same power spectrum. Recall that the power spectrum is the Fourier transform of the auto-covariance function. We define the power spectrum of X t as The spectrum of Y t is defined similarly and denoted as p Y (ω). Because the Fourier transform is invertible, equality of auto-covariances is equivalent to equality of spectra. Thus, our problem can be framed equivalently as deciding whether Estimates of the power spectrum are based on the periodogram (Box et al., 2008). The above hypothesis differs from hypotheses about a single process that are commonly tested with auto-covariance functions or power spectra. For instance, the hypothesis of vanishing auto-correlation often is assessed by comparing the sample auto-correlation function to ±1.96/ √ N, where N is the length of the time series (e.g., Brockwell and Davis, 2002, chap. 1). In the spectral domain, the hypothesis of white noise often is tested based on the Kolmogorov-Smirnov test, in which the standardized cumulative periodogram is compared to an appropriate set of lines (e.g., Jenkins and Watts, 1968, Sect. 6.3.2). These tests consider hypotheses about one process. In contrast, our hypothesis involves a comparison of two processes. Coates and Diggle (1986) derived spectral-domain tests for equality of stationary processes. The underlying idea of these tests is that equality of power spectra implies that their ratio is independent of frequency and therefore indistinguishable from white noise. This fact suggests that standard tests for white noise can be adapted to periodogram ratios. Coates and Diggle (1986) derive a second parametric test that assumes that the log ratio of power spectra is a low-order polynomial of frequency. Lund et al. (2009) explored the above tests in the context of station data for temperature. They found that spectral methods have relatively low statistical power -that is, the methods are unlikely to detect a difference in stationary processes when such a difference exists. Our own analysis using the data discussed in Sect. 5 is consistent with Lund et al. (2009) (not shown). The fact that spectral-domain tests have less statistical power than time-domain tests is not surprising. After all, spectral tests are based on comparing periodograms in which the number of unknown parameters grows with sample size. In particular, for a time series of length N, the periodogram combines coefficients for sines and cosines of a Fourier transform into N/2 coefficients for the amplitude. Thus, a time series of length N = 64 yields a periodogram with 32 amplitudes; a time series of length N = 512 yields a periodogram with 256 amplitudes; and so on. Because the number of unknowns grows with sample size, the sampling error of the individual periodogram estimates does not decrease with sample size. Typically, sampling errors are reduced by smoothing periodogram estimates over frequencies, but the smoothing makes implicit assumptions about the shape of the underlying power spectrum. In the absence of hypotheses to constrain the power spectra, the large number of estimated parameters results in low statistical power. Lund et al. (2009) also develop and discuss a time-domain test. This test is based on the sample estimate of the autocovariance function whereμ X is the sample mean of X t based on sample size N X . The analogous estimate for the auto-covariance function of Y t is denotedĉ Y (τ ). The test is based on differences in auto-covariance functions up to lag τ 0 . That is, the test is based on In the case of equal sample sizes N X = N Y = N, Lund et al. (2009) propose the statistic where W is an estimate of the covariance matrix between auto-covariance estimates, namelŷ andĉ(τ ) is the pooled auto-covariance estimatê The reasoning behind this statistic is lucidly discussed in Lund et al. (2009) and follows from standard results in time series analysis (see proposition 7.3.2 in Brockwell and Davis, 1991). Under the null hypothesis of equal auto-covariance functions, and for large N , the statistic χ 2 has an approximate chi-square distribution with τ 0 + 1 degrees of freedom.
A key parameter in this statistic is the cutoff parameter K in the sum for W. Lund et al. (2009) propose using K = N 1/3 but acknowledge that this rule needs further study. We have applied Lund et al.'s test to the numerical examples discussed in Sect. 5 and find that W is not always positive definite. In such cases, χ 2 is not guaranteed to be positive and therefore does not have a chi-square distribution. The lack of positive definiteness sometimes can be avoided by choosing a slightly different K, but in many of these cases the resulting χ 2 depends sensitively on K. This sensitivity arises from the fact that W is close to singular, so changing K by one unit can change χ 2 by more than an order of magnitude. Conceivably, some modification of W or some rule for choosing K can remove this sensitivity to K, but without such modification this test is considered unreliable. Further comments about this are given at the end of Sect. 5.

Comparing autoregressive models
Several authors have proposed approaches to comparing stationary processes based on assuming that the time series come from autoregressive models of order p, denoted AR(p). Accordingly, we consider the two AR(p) models where the φs are autoregressive parameters, the γ s are constants that control the mean, and the t s are independent Gaussian white noise processes with zero mean and variances By construction, X t and Y s are independent for all t and s. Our method can be generalized to handle correlations between X t and Y s , specifically by using a vector autoregressive model with coupling between the two processes, but this generalization will not be considered here. For the above models, there exists a one-to-one relation between the first p + 1 auto-covariances c(0), . . ., c(p) and the parameters φ 1 , . . ., φ p , σ 2 . This relation is expressed through the Yule-Walker equations and a balance equation for variance (Box et al., 2008). Therefore, equality of the first p + 1 auto-covariances is equivalent to equality of the AR(p) parameters. Furthermore, estimates of the parameters depend only on the first p + 1 auto-covariances. The remaining autocovariances of an AR(p) process c(p + 1), c(p + 2), . . . are obtained through recursion formulas that depend only on the first p +1 auto-covariances. Importantly, the auto-covariance function of an AR(p) process does not depend on γ ; rather, γ controls the mean of the process. In climate studies, the mean often is removed from time series before any analysis, hence we do not require the means to be equal; i.e., we allow γ X = γ Y .
In previous tests, different authors make different assumptions about the noise variance. Maharaj (2000) allows the noise in the two models to differ in their variances and to be correlated at zero lag. Grant and Quinn (2017) allow the noise variances to differ but assume the noises are independent at zero lag. In the climate applications we have in mind, differences in variance are important, hence we are interested in detecting differences in noise variances. Accordingly, our null hypothesis of equivalent stationary processes is the following: The alternative hypothesis is that at least one of the above parameters differs between the two processes. Constraining processes to be AR(p) means that each process is characterized by p + 1 parameters, in contrast to the unconstrained case in which the auto-covariance function or power spectrum is characterized by an infinite number of parameters. The likelihood ratio test for hypothesis H 0 is derived in the Appendix. The derivation differs from that of Grant and Quinn (2017) primarily by including the hypothesis σ X = σ Y in the null hypothesis, which leads to considerable simplifications in estimation and interpretation. Furthermore, we propose a modification to correct for biases and a Monte Carlo technique for determining significance thresholds. We describe the overall procedure here and direct the reader to the Appendix for details. The test relies on sample estimates of the above parameters, so we define these first. The autoregressive parameters in Eq. (11) are estimated using the method of least squares, yielding the least squares estimateŝ φ X 1 , . . .,φ X p ,γ X . (The least squares method arises when max-imizing the conditional likelihood, which is conditional on the specific realization of X 1 , . . ., X p in the data. For large sample sizes, the conditional least squares estimates will be close to the familiar maximum likelihood estimates.) The noise variance for X t is estimated using the unbiased estimator where ν X = N X −2p −1 is degrees of freedom, computed as the difference between the sample size N X − p and the number of estimated parameters p + 1. This noise variance estimate is merely the residual variance of the AR model and is readily available from standard time series analysis software. Similarly, the method of least squares is used to estimate the AR parameters for Y t in Eq. (12), yielding the least squares estimatesφ Y 1 , . . .,φ Y p ,γ Y and the noise variance estimatê where and the remaining variables are defined in Table 1. Under the null hypothesis H 0 , it is shown in the Appendix that F σ and F φ|σ have the following approximate distributions: Furthermore, the two statistics are independent. Critical values for D σ and D φ|σ are obtained individually from the critical values of the F -distribution, taking care to use the correct one-tailed or two-tailed test (see Appendix, particularly Eq. A31). In principle, the exact sampling distribution of D φ,σ can be derived analytically because D φ,σ is the sum of two random variables with known distributions. However, this analytic calculation is cumbersome, whereas the quantiles of D σ + D φ|σ can be estimated accurately and quickly by Monte Carlo techniques. Essentially, one draws random samples from F ν X ,ν Y and F p,ν X +ν Y , substitutes these into Eqs. (18)-(19), evaluates D φ,σ in Eq. (17), and then repeats this many times (e.g., 10 000 times). Note that the required repetitions do not grow with sample size N X and N Y since the F -distributions can be sampled directly. The 5 % significance threshold for D φ,σ , denoted D crit , is then obtained from the 95th percentile of the Monte Carlo samples. Although a Monte Carlo technique is used to obtain critical values, the test still constitutes a test for small sample sizes. The above test assumes that time series come from an AR(p) process of known order and are excited by Gaussian noise. Grant and Quinn (2017) use Monte Carlo simulations to show that the test is robust to non-Gaussianity. Since our proposed method assumes normal distributions through Eqs. (20) and (21), its robustness to departures to Gaussian remains an open question and is a topic for future study. If the order of the process is unknown and has to be selected or the time series does not come from an autoregressive model (e.g., the process is a moving average process), then the type I error rate does not match its nominal value. In such cases, Grant and Quinn (2017) propose using some sufficiently large value of p, such as where k denotes the largest integer smaller than or equal to k. The resulting AR(p * ) model can capture the first p * + 1 auto-covariances regardless of their true origin. Of course, there will be some loss of power when a test based on p * is applied to a time series that comes from an AR(p) model with p < p * . This criterion seems to work well in our examples, as discussed in more detail in Sect. 5.

Interpretation of differences in AR processes
If a difference in the AR process is detected, then we would like to interpret those differences. After all, such comparisons often are motivated to validate climate models, so if a discrepancy is found we would like to describe those differences in meaningful terms. The fact that the statistic D φ,σ can be expressed as the sum of two independent terms suggests that it is natural to consider the two terms separately. The first term D σ depends only on the difference in noise variance estimatesσ 2 X andσ 2 Y . The noise variance σ 2 X is the one-step prediction error of the AR model. After all, if the AR parameters in Eq. (11) were known exactly, then the conditional mean would be where "S" stands for "signal", and the one-step prediction error would be Comparing the variance of one-step prediction errors is more statistically straightforward than comparing variances of the original time series because prediction errors are approximately uncorrelated, since they are the residuals of the AR model, which acts as a pre-whitening transformation. In contrast, comparing variances of time series is not straightforward because of serial correlation. In practice, the residuals are only approximately uncorrelated because the parameter values are only approximate. The second term D φ|σ vanishes when φ X = φ Y and is positive otherwise, hence it clearly measures the difference in AR parameters. To further interpret this term, it is helpful to partition the AR model for X t into two parts, an unpredictable part associated with the noise X t and a predictable part S X . The predictable part is the response to the "initial condition" With this notation,Ŝ X = u T φ X is the estimated predictable response of X t . The estimated predictable response of Y t can be defined analogously and denotedŜ Y . The initial conditions u may be drawn from the stationary distribution of X t , the stationary distribution Y t , or some mixture of the two. In fact, the covariance matrix HM defined in Eq. (A40) is proportional to the "harmonic mean" of the sample time-lagged covariance matrices for X t and Y t . Assuming that the initial condition u is drawn from a distribution with covariance matrix HM independently of u, then If the initial condition for X t and Y t is contrived to be the same u, then Comparing this expression to F φ|σ in Table 1 suggests that F φ|σ is a kind of signal-to-noise ratio, where the "signal" is a difference in predictable responses for the same initial condition. This difference in predictable responses, and hence F φ|σ , can be related to the difference in R-squares of the individual time series. To see this, expand Eq. (27) as The Cauchy-Schwarz inequality implies that Hence, the above two expressions imply that

164
T. DelSole and M. K. Tippett: Comparing time series -Part 1 In the special case of equal noise variances, the above inequality becomes where SNR X and SNR Y are the signal-to-noise ratios of the two AR models: Note that pF φ|σ is an estimate of the left-hand side of Eq. (31). Recall that R-square is defined as one minus the ratio of the noise variance to the total variance: Because R-square and signal-to-noise ratio are one-to-one, inequality Eq. (31) implies the following: if the noise variances are identical, then a large difference in predictabilities (i.e., R-squares) necessarily implies a tendency for F φ|σ to be large. This suggests that it may be informative to compare R-squares. We use the sample estimatê which is always between 0 and 1. It should be recognized that a difference in R-square is a sufficient, but not necessary, condition for a difference in predictable responses. This fact can be seen from the fact that R-square for an AR(p) model is In particular, two processes may have the same R-square because the combination of φs yields the same R-square, but the specific values of the φs may differ. Although a difference in R-squares is not a perfect indicator of F φ|σ , it is at least a sufficient condition and therefore worth examination.
On the other hand, if the R-squares are the same, a difference in process still could be detected and should be explained. Simply identifying differences in φs would be unsatisfying because those differences have a complicated relation to the statistical characteristics of the process when p > 1. Accordingly, we propose the following approach, which despite its limitations may still be insightful. Our basic idea is to choose initial conditions that maximize the mean square difference in predictable responses. To be most useful, we want this choice of initial condition to account for a multimodel comparison over M models. Accordingly, we define the mean square difference between predictable responses as where Our goal is to choose the initial condition u that maximizes [u]. The initial condition u must be constrained in some way, otherwise [u] is unbounded and there is no maximum. If the initial conditions are drawn from a distribution with covariance matrix M , then an appropriate constraint is to fix the associated Mahalanobis distance: The problem is now to maximize u T Au subject to the constraint u T −1 M u = 1. This is a standard optimization problem that is merely a generalization of principal component analysis (also called empirical orthogonal function analysis). The solution is obtained by solving the generalized eigenvalue problem This eigenvalue problem yields p eigenvalues and p eigenvectors. The eigenvalues can be ordered from largest to smallest, λ 1 ≥ λ 2 ≥ . . . ≥ λ p , and the corresponding eigenvalues can be denoted u 1 , u 2 , . . ., u p . The first eigenvector u 1 gives the initial condition that maximizes the sum square difference in predictable responses; the second eigenvector u 2 gives the initial condition that maximizes subject to the condition that u T 2 −1 M u 1 = 0; and so on. The eigenvalues λ 1 , . . ., λ p give the corresponding values of . The eigenvectors can be collected into the p × p matrix Because the matrices A and M are symmetric, the eigenvectors can be chosen to satisfy the orthogonality property Summing over all eigenvectors gives Note the similarity of the final expression to Eq. (27). This similarity implies that if only two models are compared and M = HM , then the solution exactly matches the variance of signal differences (27). Unfortunately, HM and the noise varianceσ 2 depend on (m, m ), so it is difficult to generalize the optimal initial condition to explain all pair-wise values of F φ|σ . We suggest using a covariance matrix that is the harmonic mean across all M time series: where m is the sample covariance matrix of the initial condition Eq. (25) for the mth time series. Finally, we note that Thus, λ k / sum is the fraction of sum total variance of signal differences explained by the kth eigenvector. Conceptually, each eigenvector u k can be interpreted as an initial condition that has been optimized to separate predictable responses.

Example: diagnosing differences in AMOC simulations
To illustrate the proposed method, we apply it to an index of the Atlantic Meridional Overturning Circulation (AMOC). This variable is chosen because it is considered to be a major source of decadal variability and predictability (Buckley and Marshall, 2016;Zhang et al., 2019). However, the AMOC is not observed directly with sufficient frequency and consistency to constrain its variability on decadal timescales. As a result, there has been a heavy reliance on coupled atmosphere-ocean models to characterize AMOC variability and predictability. The question arises as to whether simulations of the AMOC by different models can be distinguished and, if so, how they differ. The data used in this study come from pre-industrial control runs from Phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al., 2010). These control simulations lack year-to-year changes in forcing and thereby permit a focus on internal variability without confounding effects due to anthropogenic climate change. We consider only models that have pre-industrial control simulations spanning at least 500 years and contain meridional overturning circulation as an output variable. Only 10 models meet these criteria. An AMOC index is defined as the annual mean of the maximum meridional overturning streamfunction at 40 • N in the Atlantic. A third-order polynomial over the 500 years is removed to eliminate climate drift. The AMOC index from the 10 models is shown in Fig. 1. Based on visual comparisons, one might perceive differences in amplitude (e.g., the MPI models tend to have larger amplitudes than other models) and differences in the degree of persistence (e.g., highfrequency variability is more evident in INMCM4 than in other models), but whether these differences are statistically significant remains to be established.
To compare autoregressive processes, it is necessary to select the order of the autoregressive model. As discussed in Sect. 3, we use Eq. (22), which for ν = 1.0 gives p * = 5. One check on this choice is whether the residuals are white noise. We find that AR(5) is adequate for all models except MRI, in the sense that the residuals reveal no serious departures from white noise according to visual inspection and by the Ljung-Box test (Ljung and Box, 1978). Another check is to compute Akaike's information criterion (AIC; Box et al., 2008) for the two halves of the data (i.e., 250 years). Some representative examples are shown in Fig. 2. As can be seen, AIC is a relatively flat function of model order, hence small sampling variations can lead to large changes in order selection. Indeed, the order selected by AIC is sensitive to which half of the data is used. Nevertheless, because AIC is nearly a flat function of order, virtually any choice of order beyond AR(2) can be defended. The highest order selected by AIC is p = 11. While we have performed our test for AR(11), this order would be a misleading illustration of the method because one might assume that 11 lags are necessary to identify model differences. Thus, in the results that follow, we choose AR(5) for all cases and bear in mind that comparisons with MRI may be affected by model inadequacy.
The choice of AR(5) means that all information relevant to deciding differences in stationary processes is contained in the first six values of the sample auto-covariance functionĉ(0), . . .,ĉ(5). The sample auto-covariance function for each AMOC index is shown in Fig. 3. Recall that the zerolag auto-covarianceĉ(0) is the sample variance. The figure suggests that the time series have different variances and different decay rates, but it is unclear whether these differences are significant. Note that a standard difference-in-variance test cannot be performed here because the time series are serially correlated. One might try to modify the standard F -test by adjusting the degrees of freedom, as is sometimes advocated in the t-test (Zwiers and von Storch, 1995), but the adjustment depends on the autocorrelation function that we are trying to compare. An alternative approach is to pre-whiten the data based on the AR fit and then test differences in variance, but this is exactly equivalent to our test based on F σ .  An alternative approach to comparing stationary processes is to compare power spectra. Power spectra of the AMOC time series are shown in Fig. 4. The spectra have been offset by a multiplicative constant to reduce overlap (otherwise the different curves would obscure each other). While many differences can be seen, the question is whether those differences are significant after accounting for sampling uncertainties. The two spectral tests discussed in Sect. 2 find relatively few differences between CMIP5 models, although they do indicate that time series from INMCM4 and MPI differ from Figure 5. A measure of the "distance" between AMOC time series between the first and second halves of CMIP5 pre-industrial control simulations. The distance is measured by the bias-corrected deviance statistic D φ,σ using an fifth-order AR model. The light and dark gray shadings show values exceeding the 5 % and 1 % significance levels, respectively (the threshold values are 12.7 and 17.0, respectively). those of other CMIP5 models (not shown). Presumably, these differences arise from the fact that the INMCM4 time series is closer to white noise and that the MPI time series have larger total variance than those from other CMIP5 models.
To illustrate our proposed method, we perform the following analysis. First, the AMOC index from each CMIP5 model is split into equal halves, each 250 years long. Then, each time series from the first half is compared to each time series in the second half. Some of these comparisons will involve time series from the same CMIP5 model. Our expectation is that no difference should be detected when time series from two different halves of the same CMIP5 model are compared. To summarize the comparisons, we show in Fig. 5 a matrix of the bias-corrected deviance statistic D φ,σ for every possible model comparison. This statistic is a measure of the "distance" between stationary process. The two shades of gray indicate a significant difference in AR process at the 5 % or 1 % level. Values along the diagonal correspond to comparisons of AMOC time series from the same CMIP5 model. No significant difference is detected when time series come from the same model. In contrast, significant differences are detected in most of the cases when the time series come from different CMIP5 models. Interestingly, models from the same institution tend to be indistinguishable from each other. For Figure 6. A set of points in a two-dimensional Cartesian plane whose pair-wise Euclidean distances most closely approximate the pair-wise distances between autoregressive processes of AMOC time series. The difference between AR processes is measured by the bias-corrected deviance statistic D φ,σ and the points are identified using multidimensional scaling techniques. There are 20 points corresponding to 10 CMIP5 models, each model time series being split in half. Time series from the same model have the same color and are joined by a line segment. Circles around selected points enclose models whose time series are statistically indistinguishable from that of the center model at the 5 % significance level. instance, the MPI models are mostly indistinguishable from each other, and the CCSM and CESM-BGC models (from the National Center for Atmospheric Research, NCAR) are mostly indistinguishable from each other. We say "mostly" because the difference depends on which half of the simulation is used in the comparison. For instance, a difference is detected when comparing the first half of the CESM1-BGC time series to the second half to the CCSM4 time series, but not vice versa. The models have been ordered to make natural clusters easy to identify in this figure. Aside from a few exceptions, the pattern of shading is the same for AR(2) and AR(10) models (not shown), demonstrating that our conclusion regarding significant differences in AR process is not sensitive to model order. Also, the pattern of shading is similar if the comparison is based on only from the first half of the simulations, or only from the second half of the simulations (not shown), indicating that our results are not sensitive to sampling errors.
To visualize natural clusters more readily, we identify a set of points in a cartesian plane whose pair-wise distances match the above distances as closely as possible. These points can be identified using a procedure called multidimensional scaling (Izenman, 2013). The procedure is to compute the deviance statistic between every possible pair of time series. Because there are 10 CMIP models and time series from each model is split in half, there are 20 time series being compared. Thus, the deviance statistic for every possible pair can be summarized in a 20×20 distance matrix. From this matrix, multidimensional scaling can find the set of points in a 20dimensional space that has this distance matrix. Moreover, it can find the points in a two-dimensional space whose distance matrix most closely approximates the original distance matrix in some objective sense. In our case, 87 % of original distance matrix can be represented by two-dimensional points. These points are shown in Fig. 6. Although 87 % of the distance matrix is represented in this figure, isolated points may have relatively large discrepancies. The average discrepancy is about 2 units, with the largest discrepancy being about 3.6 units between MRI and MPI-LR. An attractive feature of this representation is that the decision rule for statistical significance, D φ,σ > D crit , is approximately equivalent to drawing a circle of radius √ D crit around a point and identifying all the points that lie outside that circle. This geometric procedure would be exact in a 20-dimensional space, but is only approximate in the 2-dimensional space shown in Fig. 6. The figure suggests that the MPI models and IN-MCM4 form their own natural clusters. For other models, a particular choice of clusters is shown in the figure, but this is merely an example, and different clusters with different groupings could be justified. Note that all line segments connecting results from the same CMIP model are shorter than the circle radius, indicating no significant difference between time series from the same CMIP model.
It is interesting to relate the above differences to the auto-covariance functions shown in Fig. 3. It is likely that INMCM4 differs from other models because its autocovariance function decays most quickly to zero. The MPI models are distinguished from the others by their large variances. Interestingly, note that the auto-covariance function for NorESM1-M is intermediate between that of two MPI models, yet the test indicates that NorESM1-M differs from the MPI models. Presumably, the MPI models are indistinguishable because their auto-covariance functions have the same shape, including a kink at lag 1, whereas the NorESM1-M model has no kink at lag 1. This example illustrates that the test does not behave like a mean square difference between auto-covariance functions, which would cluster NorESM1-M with the MPI models.
It is worth clarifying how the above clustering technique differs from those in previous studies. Piccolo (1990) proposed a distance measure based on the Euclidean norm of the difference in AR parameters, namely Figure 7. Noise variance versus the R-square of AR(5) models estimated from 250-year segments of AMOC time series from 10 CMIP5 models. Estimates from the same CMIP5 model have the same color and are joined by a line segment. The error bar in the bottom left corner shows the critical distance for a significant difference in noise variance at the 5 % level. The x axis is on a log scale so that equal variance ratios correspond to equal linear distances. The error bar in the upper left shows a 95 % confidence interval for R-square using the method of Lee (1971) and computed from the MBESS package in R. The dashed line is the 5 % significance threshold for the R-square.
In contrast, our hypothesis test uses a Mahalanobis norm for measuring differences in AR parameters, where the covariance matrix is based on the sample covariance matrices of the time-lagged data (see Eq. A44). While the Euclidean norm Eq. (45) does have some attractive properties as discussed in Piccolo (1990), it is inconsistent with the corresponding hypothesis test for differences in AR parameters. As a result, the resulting cluster may emphasize differences with large Euclidean norms that are insignificant, or may downplay differences in small Euclidean norms that are significant. In contrast, the distance measure used in our study is consistent with a rigorous hypothesis test. Maharaj (2000) proposes a classification procedure that is consistent with hypothesis testing, but that hypothesis test does not account for differences in noise variances. Having identified differences between stationary processes, it is of interest to relate those differences to standard properties of the time series. Recall that our measure of the difference between stationary processes D φ,σ is the sum of two other measures, namely D σ and D φ|σ . Measure D σ depends only on the noise variances, that is, it depends on the mean square error of a 1-year prediction. In contrast, D φ|σ measures the difference in predictable responses of the process. As discussed in Sect. 4, we suggest examining differences in R-square. A graph of the noise variance plotted Figure 8. Predictions from each estimated AR(5) model using the optimal initial condition derived from Eq. (39). The optimal initial condition is the five black dots joined by lines, and the resulting predictions are the colored lines. Each CMIP model has two predictions corresponding to the two AR(5) models estimated from two non-overlapping 250-year time series from the same CMIP model. against R-square of each autoregressive model is shown in Fig. 7. The error bars show the critical distance for a significance difference in noise variance (lower left) and for a difference in R-square (upper right). First note that the projections of line segments onto the x axis or y axis are shorter than the respective error bars, indicating that differences in noise and differences in predictability are insignificant when estimated from the same CMIP5 model. Second, note that the relative positions of the dots are similar to those in Fig. 6. Thus, the noise variance and R-square appear to be approximate "principal coordinates" for distinguishing univariate autoregressive processes. Third, using noise variances alone, the MPI models would be grouped together, then IN-MCM4 would be grouped by itself, while at the bottom end CanESM2 and CCSM4 would be grouped together. These groupings are consistent with the clusters identified above using the full distance measure D φ,σ , suggesting that differences in noise variances explain the major differences between stationary processes. Fourth, the AR models have Rsquare values mostly between 0.25 and 0.5. Time series from INMCM4 have the smallest R-square values while time series from NorESM1 have the largest R-square values.
An alternative approach to describing differences in AR parameters is to show differences in response to the same initial condition. We use the optimization method discussed in Sect. 4 to find the initial condition that maximizes the sum square difference in responses. The result is shown in Fig. 8. Essentially, the models separate by predictability -the order of the models from top to bottom closely tracks the order of the models based on R-square seen in Fig. 7. For this initial condition, INMCM4 damps nearly to zero in one time step, whereas NorESM1-M decays the slowest among the models, consistent with expectations from R-square. This initial condition explains 82 % of the differences in response. Because optimal initial conditions form a complete set, an arbitrary initial condition can be represented as a linear combination of optimal initial conditions. If the AR models were used to predict an observational time series with covariance matrix M , then most differences between model predictions could be explained by the differences in response to a single optimal initial condition.
We end with a brief summary of our exploration of the χ 2 statistic proposed by Lund et al. (2009). As mentioned earlier, the result is sensitive to the choice of K. However, instead of summing based on the cutoff parameter K, we summed over all possible lags, and set all sample autocovariances beyond lag p to zero. Using this approach, we find that the resulting χ 2 statistic gives results very similar to ours for p = 5, including clustering the MPI models with each other, and separating them from NorEMS1-M. We have not investigated this procedure sufficiently to propose it as a general rule but mention it to suggest the possibility that some alternative rule for computing the sum Eq. (9) may yield reasonable results.

Conclusions
This paper examined tests for differences in stationary processes and proposed a new approach to characterizing those differences. The basic idea is to fit each time series to an autoregressive model and then test for differences in parameters. The likelihood ratio test for this comparison was derived in Maharaj (2000) and Grant and Quinn (2017). We have modified the test to correct for certain biases and to include a test for differences in noise variance. The latter test is of major interest in climate applications and leads to considerable simplifications in estimation and interpretation. Furthermore, the proposed test is applicable to small samples. In addition, we propose new approaches to interpreting and visualizing differences in stationary processes. The procedure was illustrated on AMOC time series from pre-industrial control simulations of 10 models in the CMIP5 archive. Based on time series 250 years in length, the procedure was able to distinguish about four clusters of models, where time series from the same CMIP5 model are grouped together in the same cluster. These clusters are identified easily using a multidimensional technique. The clusters obtained by this method were not sensitive to the order of the autoregressive model, although the number of significant differences decreases with AR order due to the larger number of parameters being estimated. Further analysis shows that these clusters can be ex-plained largely by differences in 1-year prediction errors in the AR models and differences in R-square.
The proposed method can be used to compare any stationary time series that are well fit by an autoregressive model, which includes most climate time series. Thus, this method could be used to decide whether two climate models generate the same temporal variability. A natural question is whether this approach can be generalized to compare multivariate time series. This generalization will be developed in Part 2 of this paper. The method also could be used to compare model simulations to observations, provided that the stationarity assumption is satisfied. If non-stationarity is strong, then this method would need to be modified to account for such non-stationarity, such as adding exogenous terms to the AR model. The likelihood-ratio framework can accommodate such extensions and will be developed in Part 3 of this paper.
Substituting this into the likelihood function Eq. (A1) and then taking −2 times the logarithm of the likelihood function gives −2 log L X = (N X − p) log σ 2 X + log 2π + 1 .
Similarly, suppose we have a time series of length N Y for Y t . Then, the corresponding likelihood function is and −2 times the logarithm of the likelihood function is −2 log L Y = (N Y − p) log σ 2 Y + log 2π + 1 .
Because X t and Y t are independent, the likelihood function for all the data is the product L X L Y . Under hypothesis H 0 , the likelihood function has the same form as L X L Y , except there is only a single set of autoregressive parameters φ 1 , . . ., φ p and a single noise variance σ . The corresponding likelihood function is therefore The maximum likelihood estimates of φ 1 , . . ., φ p , denoted φ 1 , . . .,φ p , are obtained from the least squares estimates of the pooled sample. Again, these estimates are obtained easily by the method of least squares. The maximum likelihood estimate of the common variance σ 2 is The corresponding log-likelihood is Finally, we compute the likelihood ratio, or equivalently, the difference in the log-likelihood functions. This difference (multiplied by −2) is called the deviance statistic and is D σ,φ =2 log L X + 2 log L Y − 2 log L σ,φ = log It is well known that maximum likelihood estimates of variance are biased. Accordingly, we define bias-corrected deviance statistics before proceeding. This can be done by replacing sample sizes by degrees of freedom. Care must be exercised in such substitutions because replacing MLEs with unbiased versions will yield likelihoods that are no longer maximized, and therefore may yield deviance statistics that are negative. Our goal is to define a bias-corrected deviance statistic that is non-negative To compute the degrees of freedom, note that the sample size of X t is N X − p, because the first p time steps are excluded from the conditional likelihood, and p + 1 parameters are being estimated, so the degrees of freedom is the difference, N X − 2p − 1. Similarly, the degrees of freedom for Y t is N Y − 2p − 1. Let the degrees of freedom be denoted ν X = N X − 2p − 1 and ν Y = N Y − 2p − 1.
Accordingly, an unbiased estimate of σ 2 X iŝ Sect. 8.9) suggest that the sum square errors have the following approximate chi-squared distributions: Because X t and Y t are independent, the associated sum square errors are independent, hence under H 0 , the statistic has an F -distribution with (ν X , ν Y ) degrees of freedom, as stated in Eq. (20). Furthermore, if H 0 is true, then has a (scaled) chi-squared distribution with ν X + ν Y degrees of freedom; i.e., SSE σ σ 2 ∼ χ 2 ν X +ν Y .
Straightforward algebra shows that D φ in Eq. (A16) can be written as a function of F σ as