This paper develops a method for determining whether two vector time series originate from a common stochastic process. The stochastic process considered incorporates both serial correlations and multivariate annual cycles. Specifically, the process is modeled as a vector autoregressive model with periodic forcing, referred to as a VARX model (where X stands for exogenous variables). The hypothesis that two VARX models share the same parameters is tested using the likelihood ratio method. The resulting test can be further decomposed into a series of tests to assess whether disparities in the VARX models stem from differences in noise parameters, autoregressive parameters, or annual cycle parameters. A comprehensive procedure for compressing discrepancies between VARX models into a minimal number of components is developed based on discriminant analysis. Using this method, the realism of climate model simulations of monthly mean North Atlantic sea surface temperatures is assessed. As expected, different simulations from the same climate model cannot be distinguished stochastically. Similarly, observations from different periods cannot be distinguished. However, every climate model differs stochastically from observations. Furthermore, each climate model differs stochastically from every other model, except when they originate from the same center. In essence, each climate model possesses a distinct fingerprint that sets it apart stochastically from both observations and models developed by other research centers. The primary factor contributing to these differences is the difference in annual cycles. The difference in annual cycles is often dominated by a single component, which can be extracted and illustrated using discriminant analysis.

Two fundamental questions arise repeatedly in climate science. (1) Has climate variability changed over time? (2) Do climate models accurately reflect reality? Answering these questions requires an objective procedure for deciding whether two time series possess identical statistical properties. Unfortunately, many procedures for deciding this have crucial limitations. Specifically, many of these procedures lack a significance test, do not account for serial correlation, or do not generalize naturally to multivariate quantities. For instance, the recent report from the Intergovernmental Panel on Climate Change frequently employed a quantity called RMSD

The above limitations of RMSDs are also present in correlation skill and probabilistic verification measures. Although recent advancements in machine learning, as highlighted in works by

We have pursued an approach to comparing time series that avoids the above limitations. Specifically, we assume that each time series is generated by an autoregressive model. Under this assumption, two time series are said to originate from the same process if they share identical autoregressive model parameters. This paper represents the fifth installment in a series that develops this methodology. Part 1 laid the foundation for our approach to comparing univariate time series. Part 2 generalized this framework to multivariate processes, incorporating bias corrections to obtain statistics with chi-squared distributions. Part 3 developed methods to diagnose dissimilarities between two time series. This included a hierarchical testing procedure to attribute differences to specific components of the autoregressive model and a discriminant analysis technique to derive optimal diagnostics that contain all relevant information about differences between data sets. Part 4 incorporated annual cycles in the framework, which required generalizing the hierarchical testing procedure to include nonstationary forcing. However, this extension was limited to univariate time series.

In this paper, we generalize our test to account for multivariate periodic signals in the time series. To do this, we add periodic forcing to a vector autoregressive model to construct a VARX model, where X stands for exogenous variables such as periodic forcing. Then, the hypothesis that two VARX models share the same parameters is tested using the likelihood ratio method. If differences are detected, a stepwise procedure is employed to assess whether disparities in the VARX models stem from differences in noise parameters, autoregressive parameters, or periodic forcing parameters. To implement the test, we introduce three practical advances. First, we adjust maximum likelihood quantities to eliminate the biases mentioned earlier, which are particularly serious in multivariate problems. Second, we develop a Monte Carlo technique to determine significance thresholds. Third, diagnostics that maximally compress differences in VARX models into the fewest number of components are developed. Some of these diagnostics were developed in

Our problem is to decide whether two multivariate time series originated from the same stochastic process. Let the two time series be denoted as

A class of stochastic models that can capture multivariate serial correlations is a vector autoregressive (VAR) model. VAR models include linear inverse models (LIMs) as a special case, where LIMs are used extensively in seasonal and decadal prediction studies

To capture variations in the mean, such as annual cycles, we add deterministic forcing to the VAR model. The resulting model is called a VAR model with exogenous variables and is denoted as VARX

Mathematically, we assume each vector is generated by a VARX model of order

In this paper, the deterministic term

To decide whether two multivariate time series came from the same stochastic process, we test the null hypothesis that the parameters of the two VARX(

If

Hypothesis

If

Summary of hypotheses in the stepwise test procedure for VARX(

The statistic for testing

The deviance for testing the equality of all VARX parameters is the sum of the sub-deviances:

The above procedure might attribute differences to a single part of the VARX model, but that part still involves many parameters, which hinders interpretation. To further isolate VARX model differences, we seek the linear combination of variables that maximizes the appropriate sub-deviance

In the case of

To diagnose differences in annual cycle parameters, the above decomposition is not very meaningful because the predictor is a fixed function of time (e.g., sinusoidal functions of time) rather than a random vector. In this case, we simply propagate parameter differences into the time domain by multiplying by the associated predictors

In this section, we apply the above method to compare the variability of monthly mean North Atlantic sea surface temperature (SST) between dynamical models and observations. We analyze the Atlantic basin over 0–60

For model simulations, we use data from Phase 5 of the Coupled Model Intercomparison Project

In general, the above time series contain secular trends: observations contain a global warming signal and control runs contain model drift. These secular trends will be removed by regressing out a polynomial in time. It is important to remove the same-order polynomial from both observations and models; otherwise, a difference could arise simply as an artifact of processing the two time series differently. It is known that greenhouse gas concentrations rise exponentially over our analysis period (1969–2018), so if we were to remove only a linear trend and then find a difference, that difference could be attributed to quadratic growth in the observations that is missing from control runs. In general, leaving any kind of forced variability in time series leads to problems of interpretation since our VARX model does not account for secular forcing. On the other hand, over-removal of low-frequency internal variability poses a lesser issue. While the resulting analysis would not relate to low-frequency variability (since it was removed), the conclusions regarding higher-frequency variability would still retain their validity. Hence, it is generally preferable to err on the side of removing a higher-order polynomial than a lower-order one. For the results presented in the figures below, a second-order polynomial in time was removed over the period 1969–2018. However, removing third-, fourth-, or higher-order polynomials removes additional low-frequency variability from the time series but does not alter any of our main conclusions regarding significant differences between observations and CMIP5 models.

For the annual cycle term, we include five annual harmonics encompassing all harmonics up to the Nyquist frequency. This choice is motivated by the same reasoning as discussed above for secular trends, i.e., that it is preferable to include more harmonics rather than fewer. If we were to select an insufficient number of harmonics, then unaccounted-for periodic signals would be misattributed as internal variability in the VARX model, leading to erroneous conclusions. On the other hand, if an excessive number of harmonics is chosen, the VARX model may become overfitted, but this potential overfitting is accounted for in the sampling distribution. By definition, overfitting implies the inclusion of predictors with vanishing regression coefficients, but the sampling distribution is independent of the specific values of the regression coefficients and therefore encompasses cases where the coefficients are zero. The primary drawback of overfitting is a reduction in statistical power. However, in our specific application, low statistical power is not a concern: our method demonstrates high effectiveness in detecting differences when they truly exist. Therefore, the negative consequences associated with overfitting, such as diminished statistical power, are not a significant issue in this context.

Laplacian eigenvectors 1, 2, 3, and 4 over the North Atlantic between the Equator and 60

Monthly time series of Laplacians 1–4 in observations (black curve at the bottom labeled “37”) and in CMIP5 models over a 10-year period. Each time series is offset by the same constant. No re-scaling is performed. Each time series is computed by projecting data onto the appropriate Laplacian over the Atlantic domain. The year on the

The total deviance between ERSSTv5 1994–2018 and each CMIP5 model. The horizontal gray line shows the 1 % significance threshold. Also shown is the deviance between ERSSTv5 for the two periods 1969–1993 and 1994–2018 (first x-tick mark on the left).

Same as Fig.

The total deviance between each 25-year segment from CMIP5 models and observations to an independent 25-year segment from CMIP5 models and observations. The deviance is normalized by the 1 % significance threshold. Values that are insignificant, significant at the 1 % level, and significant at the 0.2 % level are indicated by no shading, light gray shading, and dark gray shading, respectively.

A dendrogram based on the total deviance of monthly mean North Atlantic SST. The

Decomposition of each deviance in Fig.

Decomposition of annual cycle deviance by discriminant analysis. The horizontal gray line is the 1 % significance threshold for the maximum deviance, computed as described at the end of Sect.

The spatial pattern

Same as Fig.

Variance ratios from discriminant analysis of noise variance, where the ratio is CMIP5 model noise variance over observation noise variance. The two horizontal gray lines indicate the upper and lower 0.5 % significance levels.

Same as Fig.

Decomposed AR deviance from discriminant analysis. The horizontal gray line indicates the 1 % significance threshold. Models on the

The optimal initial condition of VARX(2) models

As in many climate applications, the spatial dimension in our data set exceeds the time dimension, leading to an underdetermined estimation problem. Although several regularization approaches are available, many of these have no rigorous hypothesis test framework. Here, we regularize the problem by reducing the spatial dimension, which retains the regression model framework. The question arises as to which low-dimensional space should be selected. Our choice is guided by the fact that numerical solutions have less reliability with a decreasing spatial scale, with the least reliable results at the grid-point scale. These considerations suggest that models are most reliable on the largest spatial scales, and therefore a feature space should be chosen to emphasize large spatial scales. A common approach is to use empirical orthogonal functions (EOFs), but EOFs depend on data and therefore raise the question as to which data should be used to derive them. Also, there is no guarantee that the EOFs will be strictly large scale. Furthermore, because EOFs depend on data, their use leads to biases and random fluctuations that are not straightforward to take into account in the final statistical estimate. An attractive alternative basis set that avoids these issues and satisfies the above requirement contains the leading eigenvectors of Laplace's equation. These vectors form an orthogonal set of spatial patterns ordered by the decreasing spatial scale. Familiar examples of Laplacian eigenvectors include Fourier series and spherical harmonics. The algorithm of

How many Laplacian eigenvectors should be chosen? In this study, our goal is to compare variability, not to make predictions. Accordingly, our primary concern is to ensure that the VARX(

For reference, 10-year segments of time series for the first four Laplacian eigenvectors are shown in Fig.

The total deviance between ERSSTv5 1994–2018 and each CMIP5 model is shown in Fig.

The above examples are based on using 25 years of data for both climate models and observations. Adding more data merely makes the differences detected here even more significant. An interesting question is whether differences can be detected using shorter samples from climate models. Recomputing deviances using only 3 years of data (but still using 25 years of observational data) yields the results shown in Fig.

Two questions arise naturally here. First, do CMIP5 models differ from observations in a common way? This question can be addressed by comparing one model to another – a small deviance would indicate that the two models are similar and therefore differ from observations in a common way. Second, would the test correctly indicate that two time series from the same CMIP5 model are stochastically similar? Intuitively, data generated by the same model ought to be stochastically similar. However, this outcome is not assured. For instance, CMIP5 models are nonlinear and high-dimensional. There is no guarantee that variability from such models can be captured by a low-dimensional linear model. Also, our test assumes that sample sizes are sufficiently large to invoke a linear regression framework for testing hypotheses. Twenty-five years of data might not satisfy this requirement. These latter questions can be addressed by confirming that independent segments from the same CMIP5 model are stochastically indistinguishable. Both questions can be addressed by comparing one 25-year segment with a separate 25-year segment for all possible pairs of CMIP5 models and observations. The result of comparing all possible pairs is summarized in the matrix shown in Fig.

As can be seen, values along the diagonal of this matrix are insignificant. Diagonal elements correspond to comparing time series from the same model or from the same observational data set. Thus, this test indicates that time series from the same source are stochastically indistinguishable, confirming that the test performs as expected. Additional insignificant deviances are found in diagonal blocks. For instance, the

Although all the models are different, some models are more similar to each other than to others. To identify clusters, we compute a dendrogram from these deviances. The dendrogram is computed in the following way. First, each element is assigned to its own cluster. Next, the pair with the smallest deviance is clustered together using a “leaf” whose edge aligns with the deviance indicated on the

Remarkably, the majority of model names on the left-hand side of the dendrogram are grouped into consecutive pairs, indicating a consistent pairing of time series originating from the same source. This suggests that each CMIP5 model simulation possesses a distinct “fingerprint” that can be quantified using the deviance measure. However, there are a few exceptions to this pattern, i.e., HadGEM2-CC, FASTCHEM, and CCSM4. These exceptions are paired with other models from the same modeling center, supporting the earlier conclusion that models from the same center may exhibit similarities that make them indistinguishable. Consequently, CMIP5 models from the same modeling center tend to share comparable fingerprints.

Which models are closest to the observations? According to the dendrogram, after clustering the observational time series together, they are subsequently grouped with HadGEM2 models and then with NCAR models (CESM, CCSM, and FASTCHEM). Importantly, the leaves connecting to observations surpass the significance threshold, consistent with the findings shown in Fig. 4. This implies that the HadGEM2 and NCAR models are the closest to the observations, although they still clearly differ from observations.

A natural question is whether the difference in VARX models is due to differences in noise parameters, AR parameters, or annual cycle parameters. This question can be addressed by computing the decomposition in Eq. (

The above results show that the annual cycle in each CMIP5 model differs from observations and from other CMIP5 models, but they do not tell us

Sub-deviances that are insignificant according to the stepwise procedure are indicated by the dot, cross, and triangle for differences in noise, AR parameters, and annual cycle parameters, respectively. Only two models have insignificant differences in noise and AR parameters: HadGEM2-CC and GFDL-ESM2M. This means that the internal variability of these models is consistent with observations. To be fair, the noise and AR parameters of some other CMIP5 models are only marginally inconsistent with observations (e.g., the HadGEM2, MPI, and CMCC models). Also, if the observational reference is changed to the 1969–1993 period, then HadGEM2 is no longer consistent (not shown). In all the cases, the annual cycle of each CMIP5 model differs significantly from observations.

For some CMIP5 models, differences in noise parameters explain a large fraction of the total deviance. To assess whether these models exhibit common discrepancies in whitened variance, we calculate the noise deviance between all possible combinations of models and observations. The corresponding results are shown in Fig.

We applied covariance discriminant analysis to determine whether a few spatial structures can explain the noise deviance. The resulting discriminant ratios are shown in Fig.

According to Fig.

This paper presents a methodology for determining whether two vector time series originate from the same stochastic process. Such a procedure can be used to address various climate-related questions, including assessing the realism of climate simulations and quantifying changes in climate variability over time. The stochastic process under consideration is assumed to be a vector autoregressive model with exogenous variables, referred to as VARX. In this study, the exogenous variable represents annual cycles in the mean. However, in other applications, it could capture nonstationary signals such as diurnal cycles, secular changes due to solar variability, volcanic eruptions, or human-induced climate change. This paper derives a likelihood ratio test for determining the equality of VARX parameters. Additionally, an associated stepwise procedure is developed to determine the equality of noise parameters, autoregressive parameters, and annual cycle parameters. The resulting procedure is not limited to specific stochastic models employed in this study. Rather, the procedure is general and can be applied to a broader class of models, including non-periodic exogenous variables. Thus, these procedures provide a comprehensive framework for analyzing and comparing different aspects of climate time series.

Derivation of the above procedure follows an approach that is similar to the univariate case, but it is extended here to encompass multivariate applications. This extension necessitates the incorporation of bias corrections and the utilization of a Monte Carlo technique to estimate significance thresholds accurately. The Monte Carlo algorithm developed here is particularly efficient in that it uses eigenvalue methods to evaluate the ratio of determinants and avoids solving regression problems by sampling directly from the Wishart distribution. Discriminant techniques are employed to compress differences between VARX models into the minimal number of components, facilitating a more concise description. While similar techniques were introduced in previous parts of this paper series, this paper generalizes them to multivariate situations and to accommodate an arbitrary number of steps in the stepwise procedure. Consequently, the procedure and associated codes for this test supersede those discussed in earlier parts of the series, offering an improved and more comprehensive approach.

The above procedure was implemented to compare monthly mean North Atlantic sea surface temperatures between CMIP5 models and observational data, taking into account their respective annual cycles. The analysis focused on the variability projected onto the first four Laplacian eigenvectors over the North Atlantic basin, which highlight the largest spatial scales within the region. To ensure that the residuals exhibited properties akin to white noise, a VARX model of at least second order was required for most CMIP5 models. The test results indicated that not only do CMIP5 models differ stochastically from the observational data, but that they also display variations among themselves, except when models originate from the same modeling center. Differences among CMIP5 models are distinctive enough to serve as a fingerprint that differentiates a given model from any other model and from observational data.

The primary source of deviance from observations is disparities in annual cycles. To gain insight into the characteristics of these disparities, covariance discriminant analysis was employed to decompose deviance associated with annual cycles into uncorrelated components, ordered such that the first explains the largest portion of annual cycle deviance, the second explains the most deviance after the first has been removed, and so on. For certain CMIP5 models, the leading discriminant accounts for several times more annual cycle deviance than subsequent components. Specific examples of these leading discriminants were presented.

Although differences in annual cycles dominated the total deviance, differences in whitened variance were also significant across the majority of the models. Discriminant analysis revealed that most CMIP5 models underestimate whitened variance, with some models falling short by a factor of 5. A few models were found to overestimate whitened variance by a factor of 2 or more. The collective differences in whitened variances and covariances between Laplacian eigenvectors were sufficiently unique to serve as a secondary fingerprint. It is remarkable that such distinctive identifying information is encapsulated in time series even after removing serial correlations, annual cycles, and all other nonstationary signals such as trends.

Differences in autoregressive parameters accounted for only a minor portion of the overall deviance. Approximately two models displayed noise and AR parameters that aligned with observations, suggesting that their internal variability was realistically represented. However, all the models exhibited unrealistic annual cycles despite this positive characteristic.

The method discussed in this paper can analyze only relatively low-dimensional systems (for instance, the VARX model used to compare North Atlantic variability examined only four Laplacian eigenfunctions). It may be possible to combine some aspects of this approach with machine learning methods to greatly expand the number of variables that can be compared.

A standard method for testing hypotheses in VAR models is the maximum likelihood method

Although our goal is to describe the procedure for comparing the regression models (

Samples from VARX(

If

Summary of the first few hypotheses in the stepwise procedure. Parameters

A key quantity in the stepwise procedure is the number of parameters estimated under the

The initial hypothesis in the stepwise test procedure is

Procedures for testing hypotheses

Under hypotheses

As in Sect.

This section describes the computation of sub-deviances using eigenvalue methods. These methods effectively handle underflow and overflow issues and seamlessly integrate with the diagnostic procedures discussed in Sect.

To evaluate

For sufficiently large

The distributions of

Accordingly, we draw random matrices

Because the hypotheses in Table

Because sub-deviances are stochastically independent, the family-wise error rate associated with multiple testing can be constrained. In this paper, we fix the type-1 error rate of each sub-test to

The preferred order of testing the hypotheses listed in Table

If

We seek the linear combination of variables that maximizes the sub-deviance

Let the eigenvalues of Eq. (

The above results show that CDA decomposes sub-deviance into a sum of deviances between variates. The sampling distribution of the leading eigenvalue

Substituting Eq. (

Solving Eq. (

Summary of stochastic decomposition of

A particularly instructive decomposition that follows from Eq. (

The above decomposition is sensible when

The purpose of this Appendix is to prove that

We now show that

In this notation,

Hypothesis

Substituting Eq. (

R codes for performing the statistical test described in this paper are available at

The data used in this paper are publicly available from the CMIP5 archive at

Both authors participated in the writing and editing of the manuscript. TD performed the numerical calculations.

The contact author has declared that neither of the authors has any competing interests.

The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies. Some textual passages in this work have been modified for clarity using AI tools.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

This research was supported by the National Oceanic and Atmospheric Administration (grant no. NA20OAR4310401).

This paper was edited by Likun Zhang and reviewed by three anonymous referees.