This paper derives a test for deciding whether two time series come from the same stochastic model, where the time series contains periodic and serially correlated components. This test is useful for comparing dynamical model simulations to observations. The framework for deriving this test is the same as in the previous three parts: the time series are first fit to separate autoregressive models, and then the hypothesis that their parameters are equal is tested. This paper generalizes the previous tests to a limited class of nonstationary processes, namely, those represented by an autoregressive model with deterministic forcing terms. The statistic for testing differences in parameters can be decomposed into independent terms that quantify differences in noise variance, differences in autoregression parameters, and differences in forcing parameters (e.g., differences in annual cycle forcing). A hierarchical procedure for testing individual terms and quantifying the overall significance level is derived from standard methods. The test is applied to compare observations of the meridional overturning circulation from the RAPID array to Coupled Model Intercomparison Project Phase 5 (CMIP5) models. Most CMIP5 models are inconsistent with observations, with the strongest differences arising from having too little noise variance, though differences in annual cycle forcing also contribute significantly to discrepancies from observations. This appears to be the first use of a rigorous criterion to decide “equality of annual cycles” in regards to all their attributes (e.g., phases, amplitudes, frequencies) while accounting for serial correlations.

This is Part 4 of a series of papers on comparing climate time series that are serially correlated. In each of these papers, the basic idea is to fit time series to separate autoregressive (AR) models and then test whether the parameters of the two AR models are equal. A rigorous statistical test was derived for univariate time series (

Many climate time series exhibit nonstationary variability, including diurnal cycles, annual cycles, and long-term trends. An established technique for comparing nonstationary variability between models and observations is optimal fingerprinting

In the case of annual and diurnal cycles, no standard test exists for deciding whether such cycles are consistent between two data sets. To be sure, some studies test certain aspects of the diurnal or annual cycles. For instance, some studies have shown that the phase and amplitude of the annual cycle of temperature changed over the past half century

A standard approach to accounting for seasonality in time series is to filter it out by subtracting last year's value from the present and then modeling the resulting residuals by an ARMA model

Our starting point is to assume that a climate time series

The parameters in Eq. (

If the covariance matrix is known, then MLE leads to generalized least squares (GLS). In practice, however, the covariance matrix of

A model of form (

Model (

While our goal is to derive a hypothesis test, our interest is not limited to the mere decision to accept or reject a null hypothesis. After all, we know before looking at any data that the statistical model is too simple to be a complete model of reality. Rather, our goal is to quantify differences in variability between two data sets. Numerous choices exist for measuring differences in variability, but if one is not careful, one might choose a measure with such poor statistical properties as to be useless. For instance, the measure may have a large sampling variance, in which case differences in the measure may be dominated by sampling variability rather than by real differences in the underlying process. This is a real danger for serially correlated processes, as the variance of a statistic tends to increase with the degree of autocorrelation. The virtue of deriving a hypothesis test from a rigorous statistical framework (i.e., the maximum likelihood method) is that it yields a measure with attractive statistical properties, such as having minimum variance in some sense and a well-defined significance test.

Of course, model (

The test derived here ought to be useful for the development of dynamical models. In dynamical model development, limited computational resources mean that only short runs are possible, where annual and diurnal cycles might be the only meaningful differences. In other situations, the observational record is the limiting factor. For instance, sub-annual measurements of the meridional overturning circulation (MOC) have been available only recently

We first show that models (

Incidentally, it is also possible to prove that all solutions of Eq. (

Thus, Eqs. (

If we detect a difference in parameters, we do not know whether the difference is dominated by a difference in AR parameters or in cycle parameters. To isolate the source, tests on subcomponents of the model need to be performed. At this step, a difference arises between testing equality of

The periodic response of a noise-free AR(1) model to sinusoidal forcing for AR parameters

The standard method for estimating ARX models is the maximum likelihood method

Tests for equality of regression parameters have appeared previously

The general problem is to compare models of the form

Model (

For reasons discussed in Sect.

Note that

Table summarizing the hypotheses considered in the hierarchical test procedure.

The maximized likelihoods for hypotheses

The tests associated with

The deviances satisfy the identity

Because the hypotheses are nested and

Once a significant difference in a component of the ARX model is identified, there still remains the question of precisely how that component differs between ARX models. In Appendix

Although we have explored various diagnostics for optimally decomposing AR deviances and cycle deviances, we found it much more instructive to simply plot the corresponding autocorrelation function or the annual cycle response of the ARX model, and indicate the ones that differ significantly from that derived from observations. The autocorrelation of an AR(

To apply our procedure, the order of the AR process and the number of harmonics to include in the model must be specified. These must be the same in the two models being compared; otherwise, we know the processes differ and there is no need to perform the test. Whatever criterion is used, it inevitably chooses different orders and a different number of harmonics for different data sets. In such cases, we choose the highest order and the largest number of harmonics among the results. Our rationale is that underfitting is more serious than overfitting because underfitting leads to residuals with serial correlations that invalidate the distributional assumptions. In contrast, overfitting is taken into account by the test because the test makes no assumption about the value of the regression coefficients, and therefore it includes the case of overfitting in which some coefficients vanish. The main detrimental effect of overfitting is to reduce statistical power: i.e., for a given difference in ARX models, the difference becomes harder to detect as the degree of overfitting increases. This loss of power is not a serious concern in this study because, for our data, differences grow rapidly with the number of predictors.

We choose the order and number of harmonics using a criterion based on a corrected version of Akaike's information criterion

We analyze the net transport of the Atlantic meridional overturning circulation (AMOC) at 26

The first observation from RAPID occurs in April. We sample a 190-month sequence (called the “first half”) from each CMIP5 control simulation starting from the first April in the last 500 years of simulation. Then, the subsequent 190-month sequence (called the “second half”) is used as an independent sample. Note that the second sample does not begin in May because 190 is not divisible by 12. This difference in phase

Monthly time series of the maximum transport at 26

The time series under investigation are shown in Fig.

As discussed earlier, the order of the AR model and the number of annual harmonics are selected by minimizing a criterion called AICm

If the ARX(3,5) model is adequate, then the residuals should resemble Gaussian white noise. To check this, we show the autocorrelation function of the residuals of the ARX(3,5) models in Fig.

The AICm for fitting time series from the CanEMS2 model to ARX models of the form (

The AICm for fitting time series from each CMIP5 model to ARX models of the form (

The autocorrelation function of the residuals of the ARX models. The horizontal dashed lines show the upper and lower 5 % confidence limits for zero correlation.

Having chosen the ARX(3,5) model, we next fit time series from observations and from a CMIP5 model and evaluate the total deviance

Total deviance between CMIP5 simulations and observational time series of the AMOC. Each time series is 190 months long and modeled by an ARX(3,5) model. The horizontal grey line shows the 5 % significance threshold.

The total deviance between the time series shown in Fig.

Although our test is rigorous, it makes asymptotic approximations whose validity may be questioned for our particular sample. One exercise for building further confidence is to compare each time series not to observations, but to time series from other models. In such a comparison, we expect deviances to be small when time series come from the same CMIP5 model. To check this, we compare each 190-month time series to an independent 190-month time series from the CMIP5 models. The resulting deviances are shown in Fig.

In addition to similarities along the diagonal, there are additional similarities – different models from the same center have insignificant deviances. For instance, the two NCAR models (CCSM4 and CESM1-BGC) and the three Max Planck models (MIP-LR, MPI-MR, MPI-P) have insignificant deviances. Besides this, no other similarities are found. This example suggests that the deviance between 16-year AMOC indices could be used to decide whether two given time series came from dynamical models from the same center.

An alternative approach to summarizing these results is a dendogram. A dendogram visualizes the distance matrix in a way that makes multiple clusters easy to identify. A dendogram is constructed by linking elements together based on similarity. Here, distance is measured by total deviance

A dendogram showing clusters based on total deviance

The resulting dendogram is shown in Fig.

Although significant differences from observations have been detected, the test does not tell us the nature of those differences. To gain insight into the source of the differences, we decompose the deviance as in Eq. (

The AR deviance

Although we have detected a significant difference in noise variance, we do not know the direction of this difference. As discussed in Sect.

Ratio of noise variances between time series on the

The autocorrelation function from each ARX(3,5) model. Dots indicate ACFs that differ significantly from that of observations at the 5 % level for FWER. The same color scheme as used in previous figures is used, including black for observations.

Two-hundred realizations of the sample autocorrelation function from the ARX(3,5) estimated from observations (black curves), and the autocorrelation function for CNRM-CM5 (red), which was found to differ significantly from observations.

Next, we consider differences in AR parameters. According to Fig.

Recall that the testing procedure stops when different noise variances are detected. Despite this, if we proceed to test differences in annual cycles, it should be recognized that the sampling distribution of the cycle deviance depends on the ratio of noise variances. Monte Carlo experiments discussed in Appendix

The annual cycle response of each ARX(3,5) model estimated from 190-month time series from CMIP5 models (colored curves) and from observations (black curve). Dots indicate annual cycles that differ significantly from that of observations at the FWER of 5 %. Note that the dots have the opposite meaning than they do in Fig. 11.

According to Fig.

In this paper, we presented a test for comparing a limited class of nonstationary stochastic processes, namely, processes with deterministic signals, such as annual or diurnal cycles. The strategy was to introduce periodic deterministic terms in an autoregressive model, yielding an ARX model, and then to test for differences in the parameters. A test for equality of noise variances must precede other tests, otherwise the subsequent tests will depend on the ratio of variances, which is an unknown population parameter. This situation is similar to the

If a difference in parameters is detected, then it is of interest to diagnose the nature of the difference. The statistic for testing differences in parameters can be decomposed into independent terms that quantify differences in noise, differences in AR parameters, and differences in deterministic forcing. Furthermore, each of these terms can be diagnosed fairly easily in a univariate setting. For instance, differences in noise variances can be characterized by the ratio of noise variances, and differences in AR parameters can be characterized by differences in autocorrelation functions associated with the ARX models.

We applied the above procedure to compare observations of the MOC from the RAPID array to CMIP5 models, treating the annual cycle as the response to deterministic forcing. The observational record is about 16 years (more precisely, 190 months) and considered sufficiently short to ignore anthropogenic climate change. To apply the procedure, the order of the AR process and the number of annual harmonics need to be chosen. We selected these parameters using a criterion called AICm, which is a generalization of Akaike's information criterion to a mixture of deterministic and random predictors. This criterion suggested choosing five annual harmonics and a third-order AR process, hence an ARX(3,5) model.

The total deviance between observations and CMIP5 models was evaluated and indicated that only three models (all from MPI) generated simulations consistent with observations. As a check on the statistical test, we compared the 190-month time series from each CMIP5 model to another independent set of time series from the CMIP5 models. We confirmed that time series from the same CMIP5 model had small deviances (the CanESM2 model had only marginally significant deviances). Interestingly, this analysis revealed that each CMIP5 model differed from every other CMIP5 model, unless the model came from the same modeling center (e.g., Max Planck or NCAR). It seems remarkable that 16 years of AMOC observations, at one latitude, is enough to distinguish CMIP5 models.

The total deviance is dominated by differences in noise variance and cycle parameters, although the relative contribution depends on CMIP5 model. Differences in AR parameters were small for our data. In other situations differences in AR parameters may play a bigger role. For models with the most extreme deviance, the noise deviance is the dominant contributor. In all cases, the noise deviance is due to the fact that models have too little noise variance compared to observations. The cycle deviance can be diagnosed by plotting the annual cycle response from each ARX model and indicating the cycles that differ significantly from observations. Although such plots have been presented in the past, this appears to be the first use of an objective criterion to identify annual cycles that differ significantly from observed annual cycles in regards to all their attributes (e.g., phases, amplitudes, frequencies) and accounting for serial correlation.

Although we have framed our procedure in terms of annual cycles, it should be recognized that the procedure applies to

Here we show how Eq. (

In this Appendix, we describe the likelihood ratio tests for

Under

The model selection criterion AICm under

Under

Each estimation problem requires estimation of

If

Testing the hypotheses in Table

As a result of Eq. (

The test derived from Eq. (

A further point is that comparing three hypotheses

To prove the independence of the

Note that if equality of noise variance

We end by defining the associated deviance statistics and their distributions. The deviance is defined as

The sampling distribution of

Because the factors

In this Appendix, we quantify the sensitivity of the significance threshold of

Upper 5th percentiles of the cycle deviance

The results of these experiments are shown in Fig.

R codes for performing the statistical test described in this paper are available at

The data used in this paper are publicly available from the CMIP5 archive at

Both authors participated in the writing and editing of the manuscript. TD performed the numerical calculations.

The contact author has declared that neither of the authors has any competing interests.

The views expressed in this work are those of the authors and do not necessarily reflect the views of the National Oceanic and Atmospheric Administration.Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research has been supported by the National Oceanic and Atmospheric Administration (grant no. NA16OAR4310175).

This paper was edited by Seung-Ki Min and reviewed by three anonymous referees.