Given uncertainties in physical theory and numerical climate simulations, the historical temperature record is often used as a source of empirical information about climate change. Many historical trend analyses appear to de-emphasize physical and statistical assumptions: examples include regression models that treat time rather than radiative forcing as the relevant covariate, and time series methods that account for internal variability in nonparametric rather than parametric ways. However, given a limited data record and the presence of internal variability, estimating radiatively forced temperature trends in the historical record necessarily requires some assumptions. Ostensibly empirical methods can also involve an inherent conflict in assumptions: they require data records that are short enough for naive trend models to be applicable, but long enough for long-timescale internal variability to be accounted for. In the context of global mean temperatures, empirical methods that appear to de-emphasize assumptions can therefore produce misleading inferences, because the trend over the twentieth century is complex and the scale of temporal correlation is long relative to the length of the data record. We illustrate here how a simple but physically motivated trend model can provide better-fitting and more broadly applicable trend estimates and can allow for a wider array of questions to be addressed. In particular, the model allows one to distinguish, within a single statistical framework, between uncertainties in the shorter-term vs. longer-term response to radiative forcing, with implications not only on historical trends but also on uncertainties in future projections. We also investigate the consequence on inferred uncertainties of the choice of a statistical description of internal variability. While nonparametric methods may seem to avoid making explicit assumptions, we demonstrate how even misspecified parametric statistical methods, if attuned to the important characteristics of internal variability, can result in more accurate uncertainty statements about trends.

The physical basis of climate change is understood through a combination of
theory, numerical simulations and analyses of historical data. Climate change
is driven by radiative forcing, a change in net radiation (downwelling minus
upwelling, often specified at the top of atmosphere) resulting from an
imposed perturbation of a climate in equilibrium, for
example by increasing the
atmospheric concentration of a greenhouse gas. The Earth's response to
forcing is complex and not fully understood, in part due to physical
uncertainties in important feedbacks such as cloud responses (see the
assessment reports of the Intergovernmental Panel on Climate Change (IPCC),
e.g.,

Given the physical uncertainties inherent in all climate simulations, the
observed temperature record since the late nineteenth century is often used
as a source of empirical information about the Earth's systematic response to
forcing. (Figure

It can be helpful to divide approaches to using the observed temperature
record to understand aspects of mean climate change into two categories. One
common approach involves assuming a physical model of the system. (Here the
term “model” encompasses anything from a simple energy balance model to a
very complicated atmosphere–ocean general circulation model (GCM).) Analysis
then may involve estimating statistical parameters in order to best fit the
observed record. Estimated parameters have statistical uncertainty because of
the finite observational record and the internal variability inherent in the
climate system, even were the model a perfect representation of reality. Analyses of observed
temperatures using simple or moderately complex physical models
include

Another category of approaches involves analyses that are more empirical and
appear to de-emphasize assumptions about the underlying physics generating the
observed temperatures. Many studies use regression models that treat time
rather than radiative forcing as the covariate. This practice is often used,
for example, to test for significant warming (e.g.,

In both categories, analyses require characterizing internal variability for
the purpose of quantifying uncertainty, a task that also involves
assumptions. A typical approach is to assume a statistical model for the
dependence structure of the noise, such as assuming an autoregressive moving
average (ARMA) noise model with a small number of parameters, which is fit to
the residuals of whatever trend model is being used. Some authors, however,
argue for nonparametric (resampling or subsampling) methods for time series
rather than parametric approaches (e.g.,

In this work we consider a parametric method to be any that makes explicit assumptions about the functional form of the probability distribution that the data come from, described with a finite number of statistical parameters. In particular, for the purposes of this work, we consider the class of low-order ARMA models to be a parametric class. We consider a nonparametric method to be one that attempts to make fewer distributional assumptions about the data and does not involve a parametrized statistical model.

. The argument is again that these approaches are advantageous because they are ostensibly objective and require fewer assumptions.However, methods that de-emphasize assumptions, be they physical or statistical, can be problematic in the climate setting. While regressions in time are simple to apply and do not appear to make explicit assumptions about how temperatures should respond to forcing, these models both limit what can be learned from the data and can result in misleading inferences. Regressions in time are sensitive to arbitrary choices (such as the start and end date of the data analyzed), cannot be expected to apply over even modestly long time frames and cannot in general reliably separate forced trends from internal variability. Furthermore, in accounting for internal variability, nonparametric methods for time series often require long data records to work well, and can be seriously uncalibrated in data-limited settings with strong temporal correlation, such as the setting we are discussing.

In the following, we illustrate two primary points. First, we show that
targeted parametric mean models that incorporate even limited physical
information can provide better-fitting, more interpretable, and more
illuminating descriptions of the systematic response of interest compared to
approaches that de-emphasize assumptions. Second, we show that parametric
models for residual (i.e., internal) variation can provide for safer and more
accurate uncertainty quantifications in this setting than do approaches that
de-emphasize assumptions, even if the parametric model is misspecified, as
long as the parametric modeling is done with particular attention towards the
representation of low-frequency internal variability. We believe that the
analysis that we present is informative, even if not maximally so, and we
attempt to highlight both complications with our analysis as well as
important sources of information about global warming that are ignored in our
approach. Parts of our analysis share similarities with others listed above,
especially with

This article is organized as follows. In Sect.

This analysis requires estimates of historical global mean temperatures and radiative forcings. To the extent that we are interested in how temperatures may evolve in the future (and how uncertainty in the response to radiative forcing evolves as more data are observed), we also need radiative forcings associated with a plausible future scenario.

We use the Land–Ocean Temperature Index from the NASA Goddard Institute for
Space Studies (GISS)

To evaluate the effects of uncertainty in the temperature record, we repeat a
small portion of our analysis using the HadCRUT4 global annual temperature
ensemble

The primary driver of climate change during
the historical period is changing atmospheric CO

While historical concentrations of CO

For a plausible future radiative forcing scenario, we use the extended
Representative Concentration Pathway scenario 8.5
(RCP8.5)

The analysis here focuses on information
provided solely by the global mean temperature record and assumed known
forcings. We do not use additional potential sources of empirical
information, including estimates of ocean heat uptake (discussed in, e.g.,

Evaluating the systematic response of global mean surface temperatures to
forcing is complicated by the long timescales for warming of the Earth
system. Because the Earth's climate takes time to equilibrate, the near-term
(transient or centennial-scale

Since the term

A common framework is to decompose observed temperatures into two components:
a systematic component changing in response to past forcings and a residual
component representing sources of internal variability. That is, for global
mean temperatures

Estimating a model like Eq. (

The linear time trend model is widely used, and the general sense is that
such a model offers a way of testing for statistically significant changes in
mean temperature without having to make physical assumptions and without
having to believe that the true forced response is linear in time (e.g.,

While the time trend model may be routine to apply, appear objective, and
provide a good fit to the data, its use can be precarious. A proper
accounting of uncertainty in mean temperature changes relies on
distinguishing internal variability from systematic responses. The time trend
model is problematic in this respect. If the chosen time interval is short,
it can be difficult to distinguish between trends and sources of internal
variability that are correlated over longer timescales than the chosen
interval (implicitly recognized in, e.g.,

Because the time trend model cannot be applied over long time intervals for
arbitrary forcing scenarios, it also does not have a property that may be
considered important for making inferences: that we can learn more about the
systematic trend of interest by collecting more observations. There will be
only a finite amount of information about the systematic response within the
interval

Some argue that many of these problems may be overcome by using a model that is nonlinear in time, such as a spline or other nonparametric regression method. (The IPCC, for example, appears to view nonparametric extensions as more generically appropriate than the linear model.) Nonparametric regressions in time will appear to provide an even better fit to the data than the linear trend model, but many of the above arguments carry over to this setting. Such models have limited interpretational value or ability to capture systematic (non-internal) trends, since they cannot generically be expected to distinguish between the systematic trends of interest and other, internal sources of long-timescale variation in the data. Collectively, these arguments suggest that it is advisable to seek better motivated models if one is interested in understanding the systematic response of global temperatures to forcing.

A typical approach is to use more complex models, including full GCMs, to
explain the systematic response of interest. (Model output is also used in
concert with observations in the context of “detection and attribution”
studies; see, e.g., chap. 10 of

A commonly used, very simplified physical model for the response to an
instantaneous change in radiative forcing is that temperatures approach their
new equilibrium in exponential decay. That is, writing

We also use a model based off of Eq. (

The change in mean
temperature associated with a doubling of CO

Our approach differs from those in

In building model (

To illustrate the use of model (

Comparison of the fitted values for model (

In this section, we illustrate what can be learned by applying the simple
model (

To diagnose features of internal variability, spectral analysis is an
intuitive framework, since the frequency properties of internal variability
are tied to uncertainties in trends: uncertainty in smooth trends is more
strongly affected by low-frequency than high-frequency internal variability.
Figure

Using the fully parametric model, combining Eq. (

A parametric bootstrap involves generating repeated, synthetic simulations under the fitted statistical model and then refitting the model to each simulated time series to obtain new estimates of model parameters. The distribution of those estimates then gives a measure of the uncertainty in the original estimates.

. When applied to our model of the historical temperature record, the parametric bootstrap distribution shows, unsurprisingly, that in a relatively short time series and given a smooth past trajectory of forcings, it is difficult to distinguish between a climate with both a high sensitivity (large value ofDistribution of the parametric bootstrap estimates of

In the following, we will represent uncertainties using the simple bootstrap
percentile method. The percentile method is subject to criticism (e.g.,

When using the full 1880–2015 global mean surface temperature record, the
point estimate for the centennial-scale sensitivity to anthropogenic forcing
is

Using our statistical model, the historical data appear to provide a lower
bound for

Parametric bootstrap percentile intervals for the sensitivities in
model (

The IPCC's own 66 % “likely” interval for equilibrium sensitivity is

The main source of uncertainty in the upper bound for

The uncertainties in the sensitivity and rate of response parameters imply
greater uncertainties in projected longer-term future trends in global mean
temperature than in the historical and near-term projected trends. To
illustrate this, we examine the implied future trends under the hypothetical
(extended) RCP8.5 scenario, in which radiative forcing increases and then
stabilizes in the year 2150. We simulate new time series using our estimates
of model (

Projected mean temperature anomalies, and their uncertainties, under
the RCP8.5 scenario, based on estimates from model (

Projected mean temperatures, and especially their associated uncertainties,
continue to increase even after stabilization of forcing. This is a
consequence of the joint uncertainty in

On the other hand, trends in the historical and near-term response are much
more certain. The observations strongly suggest that mean temperatures
increased in the 20th century; for example, the (2.5–97.5)th percentile
interval for the mean response in the year 2000 (expressed compared to the
1951–1980 average) is well above zero at (0.4,0.6)

We have shown that the short historical temperature record alone produces
fairly uncertain estimates of the sensitivity parameter,

Evaluation of how uncertainty in the sensitivity parameter,

These estimates could be more strongly constrained by using additional
physical information. As discussed previously, the very high sensitivity
estimates in the bootstrap distribution are cases where the estimated
response time is unphysically long. Without external information about this
timescale, however, long data records are required to rule out the large
values of

One of the complicating factors in estimating trends in climate time series
is the question of whether global mean temperatures exhibit

The evidence for long memory, however, strongly depends on the assumed trend
model. Many of the aforementioned authors draw their conclusions by assuming
a linear time trend model and applying that model to the temperature record
on durations of decades to over a century. (One notable exception is

Raw periodograms of residuals from models (

The question of long memory cannot be definitively settled using a dataset of
only 136 observations, and other analyses make use of longer climate model
runs or the paleoclimate record (e.g.,

The analysis thus far has assumed that both radiative forcings and temperatures are known exactly, but uncertainty in the sensitivity and in trends also propagates from uncertainty in these quantities. We therefore discuss at least roughly the potential implications of imperfect knowledge of these inputs.

Of the two factors, uncertainty in radiative forcings, particularly from
aerosols, is more consequential, especially for the inferred lower bound of
the sensitivity parameter from model (

Uncertainties in the global mean surface temperature record are comparatively
less important. To partially address this issue, we re-estimate
model (

Some of the uncertainties discussed so far could be addressed in a Bayesian
framework (as in, e.g.,

In Sects.

For the purposes of this illustration, we will use not the actual temperature
record but rather some simple synthetic examples. We consider several
artificial, trendless time series (the true mean of the process is constant)
with temporally correlated noise, and evaluate the results of testing for a
linear time trend (i.e., fitting model Eq.

We consider a few parametric approaches common in time series analysis. The
typical practice is to assume that the noise follows a low-order model, such
as an ARMA model. Ideally, the noise model would be chosen (as in
Sect.

We also evaluate the perhaps most typical nonparametric method for accounting
for dependence in a time series, the block bootstrap

While the block bootstrap works very well in some settings, the procedure is not free of assumptions. Like other variants of the nonparametric bootstrap, its justification is based on an asymptotic argument; for the block bootstrap to work well, the size of the block has to be small compared to the overall length of the data but large compared to the scale of the temporal correlation in the data. When the overall data record is short and internal variability is substantially positively correlated in time, as for the historical global mean temperature record, these dual assumptions may not both be met and we should not expect the block bootstrap to perform well.

In the following, we compare five methods (four parametric and one
nonparametric) for generating nominal

After generating nominal

We first compare the performance of the five methods for generating nominal

Quantile–quantile plots comparing the distribution of nominal

It may, on the other hand, be surprising that automatically chosen parametric
methods and nonparametric methods (blind selection via AICc and the block
bootstrap; Fig.

In actual practice, it can be advantageous, as we already discussed, to
choose a noise model not automatically but in consultation with diagnostics
(such as by comparing theoretical spectral densities with the empirical
periodogram). In Sect.

The comparisons in the previous section were too favorable to the
pre-specified parametric methods because the order of the specified noise
model (an AR(1) model) was known to be correct. Now we compare these methods
when the assumed noise model is misspecified. The performance of misspecified
methods will depend in particular on how the misspecified model represents
low- vs. high-frequency variations in the noise process. Models that
underestimate low-frequency variability will tend to be anti-conservative for
estimating uncertainties in smooth trends, whereas those that overestimate
low-frequency variability will tend to be conservative. We therefore repeat
the illustrations in the previous section generating the synthetic time
series from two different noise models (but still using the pre-specified
AR(1) model to generate nominal

The power spectra associated with the two models from which we
generate synthetic time series in Sect.

Same as Fig.

First, we consider an ARMA(1,1) process, whose best AR(1) approximation
over-represents low-frequency variability (Fig.

Second, we consider a fractionally integrated AR(1) process; because this is
a long-memory process, the best AR(1) approximation (and indeed any ARMA
model) will severely under-represent low-frequency variability
(Fig.

Same as Fig.

These results confirm that approaches to representing noise that appear to weaken assumptions are not guaranteed to outperform even misspecified parametric models. Misspecified parametric models are most dangerous when low-frequency variability is under-represented, but methods like the block bootstrap will also have the most trouble when low-frequency variability is strong because very long blocks will be required to adequately capture the scale of dependence in the data. While it is crucial for the data analyst to scrutinize any assumed parametric model, we believe that in many settings when the time series is not very long relative to the scale of correlation, one will be better served by carefully choosing a low-order parametric model rather than resorting to nonparametric methods.

Note that this illustration uses synthetic simulations that are relatively
strongly correlated in time, a feature of the global mean temperature record.
Nonparametric methods can work better than illustrated here in settings where
correlations are weaker. For example,

We have sought to show here that targeted parametric modeling of global mean temperature trends and internal variability can provide more informative and accurate analyses of the global mean temperature record than can more empirical methods. Since all analyses involve assumptions, it is important to consider the role that assumptions play in resulting conclusions. In the setting of analyzing historical global mean temperatures, where the data record is relatively short and temporal correlation is relatively strong, ostensibly more empirical methods can fail to distinguish between systematic trends and internal variability, and can give seriously uncalibrated estimates of uncertainty. While linear-in-time models can be used for some purposes when applied to moderately narrow time frames (and with careful uncertainty quantification), the demonstrations shown here suggest that they do not have an intrinsic advantage over more targeted analyses. Targeted analyses can be used over longer time frames – allowing for better estimates of both trends and noise characteristics – and can address a broader range of questions within a single framework.

The model we use in our analysis provides insights about the information contained in the historical temperature record relevant to both shorter-term and longer-term trend projections. The limited historical record of global mean temperature can provide information about shorter-term trends but unsurprisingly cannot constrain long-term projections very well. The past 136 years of temperatures simply do not alone contain the relevant information about equilibration timescales that would be required to constrain long-term projections, especially when aggregated to a single global value. (Use of spatially disaggregated data may provide additional information.) The distinction between uncertainties in shorter-term and longer-term projections, itself not easily made using a time trend model, serves to further illustrate that while the historical data record is an important source of information, it alone cannot be expected to answer the most important questions about climate change without bringing more scientific information to bear on the problem.

We believe that our discussion is illustrative of broader issues that arise
in applied statistical practice, and will have particular relevance to
problems involving trend estimation in the presence of temporally correlated
data and in relatively data-limited settings, common in climate applications.
We suspect that many applied statisticians have personally felt the tension
between targeted modeling on the one hand and more empirical analyses on the
other. One lucid discussion of the broader issues surrounding this tension
can be found in the discussion of model formulation in

The NASA GISS Land–Ocean Temperature
index is updated periodically; the data we analyze were accessed on the date
2016-02-03. The current version is available at

Historical radiative forcings until 2011 are available in

In this paper, we use the historical temperature record to estimate the trend
model (

Model (

We can compare the results of our model to reported results from GCMs by
estimating the

We can also compare our estimate of

A distinguishing feature of this other, common approach is that it includes data about historical heat uptake, in addition to temperature and radiative forcing data. Ocean heat content is an additional, albeit uncertain, source of information that may improve estimates. On the other hand, since these methods do not involve an explicit trend model and require averaging the inputs over decadal or longer time spans, they cannot use the historical temperature record to estimate internal temperature variability. Most studies therefore estimate internal variability using climate model output, but climate models do not perfectly realistically represent even present-day variability in global annual mean temperature.

By contrast, an advantage of our approach is that it allows one to use the
historical data to understand internal variability. Additionally, our
approach allows one to answer questions about both historical trends and
longer-term projections in the framework of one statistical model, whereas
the approaches discussed above do not allow one to infer trends in
increasing-in-time forcing scenarios. A disadvantage of our approach is that,
as discussed above, we rely on the historical global mean temperature record
to estimate the “equilibration” timescales (

Regardless of the different advantages and disadvantages just discussed, both
approaches to using the historical temperature record give similar results
concerning the sensitivity parameter, and uncertainties in this parameter
remain high. This demonstrates the limitations of the information content of
the historical global mean temperature record alone for estimating
longer-term projections of mean temperature changes. As noted in
Sect.

Comparison of estimates of a sensitivity
parameter from studies that use observational data and a simple energy
balance approach. The large best (median) estimate from

The authors declare that they have no conflict of interest.

The authors thank Jonah Bloch-Johnson, Malte Jansen, Cristian Proistosescu, and Kate Marvel for helpful conversations and comments related to parts of this work. We additionally thank the reviewers of this paper, whose suggestions led to a number of improvements. This work was supported in part by STATMOS, the Research Network for Statistical Methods for Atmospheric and Oceanic Sciences (NSF-DMS awards 1106862, 1106974 and 1107046), and RDCEP, the University of Chicago Center for Robust Decision-making in Climate and Energy Policy (NSF grant SES-0951576). We thank NASA GISS, NOAA, the Hadley Centre, the IPCC, and IIASA for the use of their publicly available data. We acknowledge the University of Chicago Research Computing Center, whose resources were used in the completion of this work.Edited by: C. Forest Reviewed by: three anonymous referees