Estimates of climate system properties incorporating recent climate change

. Historical time series of surface temperature and ocean heat content changes are commonly used metrics to diagnose climate change and estimate properties of the climate system. We show that recent trends, namely the slowing of surface temperature rise at the beginning of the 21st century and the acceleration of heat stored in the deep ocean, have a substantial impact on these estimates. Using the Massachusetts Institute of Technology Earth System Model (MESM), we vary three model parameters that inﬂuence the behavior of the climate system: effective climate sensitivity (ECS), the effective ocean diffusivity of heat anomalies by all mixing processes ( K v ), and the net anthropogenic aerosol forcing scaling factor. Each model run is compared to observed changes in decadal mean surface temperature anomalies and the trend in global mean ocean heat content change to derive a joint probability distribution function for the model parameters. Marginal distributions for individual parameters are found by integrating over the other two parameters. To investigate how the inclusion of recent temperature changes affects our estimates, we systematically include additional data by choosing periods that end in 1990, 2000, and 2010. We ﬁnd that estimates of ECS increase in response to rising global surface temperatures when data beyond 1990 are included, but due to the slowdown of surface temperature rise in the early 21st century, estimates when using data up to 2000 are greater than when data up to 2010 are used. We also show that estimates of K v increase in response to the acceleration of heat stored in the ocean as data beyond 1990 are included. Further, we highlight how including spatial patterns of surface temperature change modiﬁes the estimates. We show that including latitudinal structure in the climate change signal impacts properties with spatial dependence, namely the aerosol forcing pattern, more than properties deﬁned for the global mean, climate sensitivity, and ocean diffusivity.


Introduction
Scientists, policy makers, and the general public are concerned with how surface temperature will change in the coming decades and further into the future. These changes depend on many aspects of the climate system. Among them are climate sensitivity and the rate at which heat is mixed into the deep ocean. Equilibrium climate sensitivity (ECS) represents the global mean surface temperature change that would be realized due to a doubling of CO 2 concentrations after equilibrium is reached. A shorter-term measure of climate sensitivity to greenhouse gas forcing is transient climate response (TCR), defined as the global mean surface temperature change at the time of CO 2 doubling in response to CO 2 concentrations increasing at a rate of 1 % per year (Bindoff et al., 2013). Due to the climate system not being in equilibrium, interactions between the surface and the ocean lead to an exchange of energy. In such a scenario, TCR is a function of both the climate sensitivity and ocean circulation and mixing (Sokolov et al., 2003;Andrews and Allen, 2008).
The value of climate sensitivity is uncertain but the processes and feedbacks which set it must be accurately modeled to reliably predict the future. To this end, a number of studies have used Earth System Models of Intermediate Complexity (EMICs) to estimate probability distribution functions (PDFs) for the values of these climate system properties, in particular ECS, ocean diffusivity, and an estimate of the anthropogenic aerosol forcing (Forest et al., 2002(Forest et al., , 2008Knutti et al., 2003;Tomassini et al., 2007;Olson et al., 2012;Aldrin et al., 2012;Libardoni and Forest, 2013, and others). In these studies, EMICs are run for many combinations of the model parameters that set the climate system properties. Model output is then compared to historical temperature change to determine which model states best match the past.
Time series of surface temperature and ocean heat content are commonly used temperature diagnostics in the evaluation of model performance because they rule out different combinations of the parameters for being inconsistent with the observed climate record (Urban and Keller, 2009). This helps to narrow the estimates of the parameters because only certain combinations lead to accurate representations of the past. Observations in the early 21st century showed that the rate of increase in global mean surface temperature slowed despite the continued rise of global CO 2 concentrations (Trenberth and Fasullo, 2013). This slowdown was the source of debate as to whether climate change was a significant threat and led scientists to search for the reasons why temperatures did not rise as much as expected. Cowtan and Way (2014) and Karl et al. (2015) argue that the slowdown was merely an artifact of the global observing system and the result of incomplete coverage in the polar regions where temperatures increase most rapidly. The slowdown was also attributed to changes in the radiative forcing. In particular, it is argued that the forcing due to the Sun, anthropogenic aerosols, and volcanoes all contributed to reduce global mean temperature in the 2000s (Huber and Knutti, 2014;Schmidt et al., 2014). Natural variability in the ocean has also been noted as a potential cause of the slowdown (Meehl et al., 2011;Huber and Knutti, 2014;Schmidt et al., 2014). In particular, Meehl et al. (2011) show that in a fully coupled, three-dimensional climate model, periods of little to no rise in surface temperatures are associated with enhanced mixing of heat below 300 m in the ocean. This finding is supported by recent observations showing that heat is accumulating more rapidly in the deep ocean (Levitus et al., 2012;Gleckler et al., 2016). Any good model simulation should be able to capture these features of the past.
In this study, we first seek to improve the methods used in previous work (Forest et al., 2008;Libardoni and Forest, 2013;Libardoni et al., 2018a). Until now, ensembles from different versions of the MIT Integrated Global Systems Model (IGSM, Sokolov et al., 2005) have been used to vary model parameters for ECS, ocean diffusivity, and the net anthropogenic aerosol scaling factor using a gridded sampling strategy. To derive PDFs for the model parameters, metrics of model performance at parameter settings in between those where the model was run are estimated using two-dimensional interpolation algorithms. These algorithms are restricted to gridded samples and at times have led to PDFs that are not smooth. We propose and implement a new method where spline interpolations are replaced with a radial basis function interpolation algorithm. We show that the new method leads to PDFs that are both true to the data and smooth by using the 1800-member ensemble of the MIT Earth System Model (MESM, Sokolov et al., 2018) described in Libardoni et al. (2018a) to derive PDFs for the three model parameters.
Using the updated methodology and the 1800 MESM runs, we answer the following questions: (1) how does the inclusion of more recent data change the PDFs of model parameters? And (2) what do we learn by including spatial information in the surface diagnostic? The inclusion of recent temperature trends can have a significant impact on the estimates of climate system properties (Urban et al., 2014;Johansson et al., 2015). The temperature pattern that the model output is compared against becomes more detailed as data are added and leads to the rejection of more model runs as being inconsistent with the observed records. This generally leads to both a shift in the estimation of a given property and a reduction in the uncertainty in the estimate. Urban et al. (2014) also showed that the ability to distinguish between different states of the climate increases as the length of the model diagnostic increases. Similar to Johansson et al. (2015), we identify the influence of including more recent data by systematically adding data to the time series.
Second, we show how including spatial variability in the surface temperature diagnostic can influence the parameter distributions. In almost all parameter estimation studies, global mean ocean heat content is used as one metric to evaluate model performance and is paired with a surface temperature diagnostic to further test the model runs. Typically, groups use time series of either global mean surface temperature (Knutti et al., 2002;Tomassini et al., 2007;Knutti and Tomassini, 2008;Urban and Keller, 2009;Olson et al., 2012) or hemispheric mean surface temperatures (Andronova and Schlesinger, 2001;Meinshausen et al., 2009;Aldrin et al., 2012;Skeie et al., 2014) as the surface diagnostic. Given the latitudinal resolution of MESM, we can estimate zonal temperature patterns beyond global and hemispheric means. In particular, we use a surface temperature diagnostic that consists of four equal-area zonal bands, allowing the observed amplification of polar warming to be included in the evaluation of model performance. We show the impact of the spatial structure of the surface diagnostic by deriving PDFs using global mean, hemispheric mean, and four zonal mean temperature diagnostics.
In Sect. 2, we introduce the general method for estimating the probability distributions for the model parameters, describe the temperature diagnostics, and introduce an interpolation method for the likelihood function using radial basis functions. We present our main findings in Sect. 3 and finish with a summary and conclusions in Sect. 4.

Methods
As outlined in Sect. 1, we propose and implement a number of methodological changes designed to improve our estimates of the probability distributions of the model parameters. Here, we first provide a general overview of our method for deriving the distributions, including a description of the model diagnostics and their derivation. We follow with a discussion of the new methods used in this study and how they are applied to deriving the new distributions.
Following a standard methodology (Forest et al., 2006(Forest et al., , 2008Libardoni and Forest, 2011;Olson et al., 2012), we derive probability distributions for the model parameters. In this method, EMICs are used to run simulations of historical climate change. By comparing model output to observations, the likelihood that a run with a given set of parameters represents the climate system is determined by how well it simulates the past climate. In this study, we use the MESM, which includes three adjustable parameters that set properties that strongly influence the behavior of the climate system. These model parameters are the cloud feedback parameter, which sets the effective climate sensitivity (ECS), the effective ocean diffusivity of heat anomalies by all mixing processes (K v ), and the net anthropogenic aerosol forcing scaling factor (F aer ). We identify each run by a unique combination of the model parameters, θ , where θ = (ECS, K v , F aer ). In this study, we take the 1800-member ensemble described in Libardoni et al. (2018a), spanning a wide range of θ s, as our model output.
We evaluate model performance by comparing each model run to two temperature diagnostics. The first diagnostic is the time series of decadal mean surface temperature anomalies in four equal-area zonal bands spanning 0-30 and 30-90 • latitude in each hemisphere. Temperature anomalies are calculated with respect to a chosen base period. The second diagnostic is the linear trend in global mean ocean heat content in the 0-2000 m layer. For each diagnostic, we now describe the data used for observations and the methods to derive the diagnostics from the observations.
For surface observations, we use datasets from four different research centers. The datasets we use include the median of the 100-member HadCRUT4 ensemble from the Hadley Centre Climatic Research Unit (Morice et al., 2012), the Merged Land-Ocean Temperature (MLOST) dataset from NOAA (Vose et al., 2012), the Berkeley Earth Surface Temperature (BEST) dataset (Rohde et al., 2013), and the GIS-TEMP dataset with 250 km smoothing (GISTEMP250) from the NASA Goddard Institute for Space Studies (Hansen et al., 2010). All datasets are given as monthly temperature anomalies on a 5×5 • latitude-longitude grid. The datasets use similar station data over land but differ on which sea surface temperature (SST) dataset is used for the ocean. In particular, the HadCRUT4 and BEST datasets use the Hadley Centre SST (HadSST) dataset (Kennedy et al., 2011a, b) and the MLOST and GISTEMP250 datasets use the Extended Reconstruction Sea Surface Temperature (ERSST) dataset . Furthermore, the base period used to calculate temperature anomalies differs among the datasets. A 1951-1980 base period is used for BEST andGISTEMP250, a 1961-1990 base period is used for HadCRUT4, and a 1971-2000 base period is used for MLOST. Lastly, the research centers differ in how they fill in sparse data regions.
We derive the surface temperature diagnostic by temporally and spatially averaging the gridded data. In the following calculation, we assume uncertainty in the observations is zero, relying on using multiple datasets to account for uncertainty in the observed record. Due to data scarcity and missing values in some regions, we set threshold criteria for each spatial and temporal average in the derivation. First, the annual mean for each 5 × 5 • grid box is calculated, provided that at least 8 months of the year have non-missing data. From these annual averages, decadal mean time series are calculated for both the period being used in the diagnostic and the chosen climatological base period. For these calculations, we require at least 8 years of defined data for a decadal mean to be defined. We also extract from the annual mean time series a data mask indicating where observations are present or missing. We use this mask on the model output to match the coverage of the observations.
Once the data mask and decadal mean time series are calculated, each time series is zonally averaged on the 5 • grid. The zonal mean is marked as undefined if there is less than 20 % longitudinal coverage in a given latitude band. We calculate temperature anomalies for each zonal band by subtracting the mean of the climatological time series for the given band from each decade of the comparison period time series. The resulting time series of decadal mean, 5 • reso-22 A. G. Libardoni et al.: Estimates incorporating recent change lution temperature anomalies are then averaged into the four equal-area zones. When aggregating to larger areas, the mean is calculated as the area-weighted average of the zonal bands contained within the larger zone.
For ocean heat content observations, we use the estimated global mean ocean heat content in the 0-2000 m layer from Levitus et al. (2012). This dataset replaces the Levitus et al. (2005) 0-3000 m global mean dataset because the latter ends in 1998 and we aim to extend the diagnostic into the 21st century. Data are presented as heat content anomalies in 5year running means, starting with the 1955-1959 pentad and ending in the 2011-2015 pentad. Also included in the Levitus et al. (2012) data is a time series of the standard error of the pentadal mean estimate for the global mean heat content. The procedure for deriving the standard error estimates is described in the study's Supplement and is based on the observational error estimates of the 1 • gridded data.
For a given diagnostic period, we calculate the linear trend in the global mean ocean heat content as the slope of the bestfit linear regression line. In the calculation of the regression line, all deviations from the mean are assigned a weight inversely proportional to the square of the standard error from the Levitus et al. (2012) data at that point in the time series. For example, the standard deviation of y from the mean, is modified by multiplying each term in the summation by its weight, giving the weighted standard deviation of y from the mean of where w i is the weight assigned to each point y i based off of the observational error estimate. All summation terms in the regression are replaced by the corresponding weighted version. By doing so, the regression is weighed more towards portions of the time series for which the standard error of the observations is small. Because observational errors decrease in latter years, more recent observations have a stronger influence on the trend estimate. Each model run is compared to the model diagnostics and evaluated through the use of a goodness-of-fit statistic, where x(θ ) and y are vectors of model output for a given set of model parameters and observed data, respectively, and C −1 N is the inverse of the noise-covariance matrix. The noisecovariance matrix is an estimate of the internal variability of the climate system and represents the temperature patterns we would expect in the absence of external forcings. We estimate the noise-covariance matrix by drawing samples of the temperature diagnostics from the control run of fully coupled general circulation climate models and calculating the covariance across the samples. Prior to this study, separate models were used for the surface and ocean diagnostics, potentially yielding inconsistent variability estimates. We eliminate that issue by using the Community Climate System Model, version 4 (CCSM4, Gent et al., 2011) to estimate the natural variability for both the surface and ocean diagnostics. In its simplest form, the r 2 statistic is the weighted sum of squares residual between the model simulation and the observed pattern. Multiplying x(θ )-y by the noise-covariance matrix rotates the patterns into the coordinate space of the natural variability and scales the differences such that r 2 is the sum of independent normals. The noise-covariance matrix is thus a pre-whitener of the residuals.
From the r 2 field, we calculate the difference between r 2 at an arbitrary point and the minimum r 2 value in the domain. The run with minimum r 2 represents the model run with parameters θ that best matches the observed record. r 2 gives a measure of how much an arbitrary run differs from the model run that produces the best fit to the observations. Whereas regions with large r 2 indicate θ s that do not simulate the particular diagnostic well, regions with small r 2 indicate θ s that simulate the particular diagnostic comparably to the minimum. Regions of high (low) r 2 can (cannot) be rejected for being inconsistent with the observed climate record.
Because of the pre-whitening by the noise-covariance matrix, r 2 is known to follow an F distribution (see Forest et al., 2001, for a complete derivation and discussion). Knowing the distribution of r 2 provides the link between the goodness-of-fit statistics and the final PDFs. Through this connection, we convert r 2 to probability distribution functions for the model parameters using the likelihood function based on an F distribution described in Libardoni and Forest (2011) and modified by Lewis (2013). Through an application of Bayes' theorem (Bayes, 1763), we combine the likelihoods from each diagnostic and a prior on the model parameters to estimate the joint PDF. We apply the expert prior derived in Webster and Sokolov (2000) to ECS and uniform priors to K v and F aer . Probability distributions for individual parameters are calculated by integrating the joint PDF over the other two parameter dimensions.
Prior to calculating the likelihood function, we interpolate the goodness-of-fit statistics onto a finer grid in the parameter space. This interpolation fills in the gaps between θ s where the model was run and increases the density of points within the domain. Forest et al. (2006) presented an interpolation method that was implemented in Libardoni and Forest (2011). The interpolation is first carried out on ECS-

√
K v planes via a spline interpolation on all F aer levels to a finer mesh of points. A second set of spline interpolations at every

√
K v point on the fine mesh then fills in the fine grid in the F aer dimension.
In this study, we implement an alternate interpolation method based off of radial basis functions (RBFs, Powell, 1977). The RBF method approximates the value of a function based off of a set of node points where the functional value is known and is a variation of kriging that does not allow the data to inform the internal parameters of the algorithm. The function value at any point in the domain is calculated as the weighted sum of the value at all nearby node points. The weight assigned to each node is related to the radial distance between the location that is being interpolated to and the node. We view this method as an improvement because it is a three-dimensional method and does not require multiple steps. We will also show in Sect. 3.1 that this leads to a smoother interpolation surface.
For our implementation, we use the 1800 r 2 values at the points θ where the model has been run as nodes. For node points, we have sampled ECS from 0.5 to 10.0 • C in increments of 0.5 • C,

√
K v from 0 to 8 cm s −1/2 in increments of 1 cm s −1/2 ), and F aer from −1.75 to 0.5 W m −2 in increments of 0.25 W m −2 . We interpolate the r 2 values from the θ s of the node points to the fine grid used in the spline interpolation method. In particular, we interpolate r 2 values for ECS between 0.5 and 10.5 • C in increments of 0.1 • C,

√
K v between 0 and 8 cm s −1/2 in increments of 0.1 cm s −1/2 , and F aer between −1.75 and 0.5 W m −2 in increments of 0.05 W m −2 . For weights, we choose Gaussian basis functions, with the weight assigned to each node given by where φ is the weight, d is the radial distance between the two points, and is a scaling parameter that determines how quickly the weight decreases with distance. Typically, RBFs are calculated in physical space, where the distance between points, d, is well defined. However, in this application, we need to apply the concept of distance in model parameter space. Because the spacing between nodes in each dimension of the parameter space is different, we normalize all distances by the range in a given parameter dimension. We recognize that this choice of normalization constant is arbitrary and in the future should be determined by a physical metric. Once normalized, we treat each parameter dimension as isometric, so that the distance between two points is represented by where subscript i refers to the interpolated point, subscript n refers to the node points, and the normalization constants are 9.5 F aer . Because the distance between any two points in the parameter space is always the same, the choice of plays a critical role in determining the behavior of the algorithm. We demonstrate this by showing the weights for six different values as a function of normalized distance (Fig. 1). Small values of lead to a slow decay and large values of lead to a rapid decay of the weighting function. The choices of are described in Appendices A and B. The weighting function is applied to each node point within the parameter space. One can imagine a sphere surrounding each of these points, with the weight assigned to that point decaying as a function of the distance from the center. All points within the parameter space are in regions where the spheres from multiple node points overlap. The interpolated value at any point is the weighted sum of the node values associated with the overlapping spheres. Thus, we calculate the r 2 value at any point in the domain as where the sum is over all N = 1800 node values. When calculating the sum, all 1800 node values are considered, but the weights from those far away in parameter space are close to zero and do not contribute to the sum. In summary, we have made a number of changes and updates to the methodology. (i) To account for a change in observational dataset, we have modified the ocean diagnostic to be estimated from the 0-2000 m layer, as opposed to the 0-3000 m layer. (ii) We now estimate the natural variability from a common model, as opposed to using different models for the surface and ocean diagnostics. (iii) We implement a new interpolation scheme where radial basis functions are used to interpolate goodness-of-fit statistics from the coarse A. G. Libardoni et al.: Estimates incorporating recent change grid of model runs to the fine grid used to derive the joint probability distribution functions.
Using the updated methodology, we show how temporal and spatial information impacts the PDFs of the model parameters. We address the temporal component by adding more recent data to the model diagnostics in one of two ways. First, we extend the diagnostics by fixing the starting date while shifting the end date forward in time. To maximize the amount of data that we use in the surface diagnostic while also ensuring good observational data coverage, we take decadal mean temperature anomalies with respect to the 1906-1995 base period starting in 1941. We then shift the end date from 1990 to 2000 to 2010 to change the diagnostics from 5 to 6 to 7 decades, respectively. For the ocean diagnostic, we choose 1955 as the starting date of the first pentad to correspond to the beginning of the observational dataset. Similar to the surface diagnostic, we increase the length of the ocean diagnostic by changing the end date of the last pentad from 1990 to 2000 to 2010.
In a second test, we fix the length of the diagnostics while shifting the end date forward in time. This maintains a 5decade diagnostic for the surface diagnostic by shifting the 50-year window from 1941-1990 to 1951-2000 to 1961-2010 and a 35-year ocean diagnostic by shifting the period we use to estimate the linear trend from 1955-1990 to 1965-2000 to 1975-2010. By deriving PDFs with each pair of diagnostics corresponding to a given end date, we determine the impact of recent temperature trends on the parameter distributions in both the extension and sliding window cases.
In a third test, we derive PDFs with different structures for the surface diagnostic. In these new diagnostics, we maintain the decadal mean temporal structure but reduce the dimensionality of the spatial structure by replacing the four zonal bands with global mean or hemispheric mean temperatures. In the former case, we have a one-dimensional spatial structure, and in the latter a two-dimensional structure.

Results
We present our findings as follows. In Sect. 3.1 we (i) show the difference in the ocean diagnostic due to changing to the 0-2000 m data, (ii) provide justification for using the RBF interpolation method, and (iii) present the impact of the methodological changes described in Sect. 2 on the parameter distributions. In Sect. 3.2, we (i) analyze how the model diagnostics change due to the inclusion of more recent data and (ii) assess how those changes impact the distributions. In Sect. 3.3, we show how including spatial patterns of surface temperature change impact the distributions.

Methodological changes
We first identify the difference in the ocean diagnostic derived from the 0-3000 and 0-2000 m layers for the common period of 1955-1996 (Fig. 2). This period is chosen to coin-1960 1970 1980 1990 Year cide with the ocean diagnostic in Libardoni and Forest (2013) and allows for a direct comparison of distributions presented later in this section. We observe a stronger warming trend of 3.6 ± 0.50 ZJ yr −1 in the 0-2000 m layer compared to the estimate of 2.7±0.39 ZJ yr −1 in the 0-3000 m layer, suggesting that the rate of heat penetration into the deep ocean decreases with depth. Second, we demonstrate the impact of switching to the RBF algorithm. For one of our surface temperature diagnostics, we interpolate the r 2 values using each of the six values presented in Sect. 2. We show the resulting r 2 patterns and compare them against the surface derived using the Forest et al. (2006) spline interpolation method and the original pattern (Fig. 3). We observe that the old method is very successful at matching the r 2 values at points where they were run (Fig. 3b). However, the surfaces are not always smooth and in some instances the location of the minimum value of r 2 shifts to a new, nearby location in the interpolated space.
We aim to improve upon the shortcomings of the old interpolation method by identifying so that not only is the spatial pattern of r 2 maintained, but the resulting response surface is also smooth. We observe smoother interpolated surfaces for lower values of because of the relationship between and the radius of influence of each node point ( Fig. 3c-h). Because we do not require the interpolated values to pass exactly through the node points, the smoothness comes at the expense of increasing the interpolation error at the node points. Unlike the old interpolation method, the errors at node points do not lead to a change in the rank order of r 2 values at the node points, however. The location of the minimum remains the same, as well as all subsequent comparisons.
We also observe a reduction in the range of r 2 values within the domain. The reduction occurs because regions where r 2 is originally low are now influenced by areas further away in the parameter space where r 2 is high, and vice versa. This is true of the algorithm in general, with the errors at each node point and the reduction of the range diminishing as increases and the radius of influence of each node point decreases. However, as increases and the radius of influence for a given node decreases, the response surface becomes less smooth. Thus, there is a tradeoff, in that decreasing the interpolation error at node points leads to a decrease in the smoothness of the surface. Small s provide the desired smoothness, while large s provide the truest fit to the actual values at the node points. This indicates that intermediate values of (e.g., 10.8 or 14.4) are appropriate.
Thus far, we have only investigated the impact of on the fit of the interpolated r 2 values to the raw values. As outlined in Sect. 2, inference on the model parameters is based  Fig. 3, except for r 2 , the difference between r 2 at a given point and the minimum r 2 value in the domain. This represents the difference between r 2 at an arbitrary point and that of the best fit of the model to the observations. on r 2 , the difference between r 2 at an arbitrary point in the parameter space and the minimum within the domain. Plotting the r 2 field as a function of confirms our assessment that intermediate values of lead to the best fit to the raw values (Fig. 4). Both = 10.8 and = 14.4 fit the raw r 2 values quite well as the inflation of low r 2 values is normalized by the subtraction of the minimum value (which is also interpolated to a greater value). However, for = 14.4, the region of best fit ( r 2 less than 10) is larger than the raw values and there are regions where the interpolated surface is not as smooth as when = 10.8. In some situations, this lack of smoothness leads to PDFs that are also not smooth and display bumps at values for the parameter settings of the node points (not shown). For these reasons, we choose = 10.8 for our analysis.
To further test our choice of , we perform an out-ofsample test on 300 runs of the MESM that were not included in the 1800 member ensemble used in this study. The parameter settings for the out-of-sample runs were the result of two separate 150-member Latin hypercube samples (McKay et al., 1979) and did not correspond to the settings of any of the node points. For each run, we calculate r 2 for the surface diagnostic matching the one used in Figs RBF interpolation method with = 10.8 and the 1800 runs as nodes (Fig. 5).
With a few exceptions, we see good agreement between r 2 calculated from the model output and r 2 estimated from the RBF algorithm. The biggest discrepancies are typically found for r 2 values greater than 50, where the likelihood function for the diagnostic approaches 0. We also note that the differences are small in regions of the parameter space where the likelihood function approaches its maximum, namely for small r 2 . Lastly, we find an almost equal number of runs where the difference between the value calculated from the model output and the value estimated from the RBF method is greater than zero and where the difference is less than zero, indicating no substantial bias in the RBF algorithm. Because we see good agreement of the RBF interpolated surface with the out-of-sample test runs and observe a smooth response surface with a good fit to the data (Figs. 3 and 4), we argue that choosing = 10.8 is appropriate.
To test the impact of the methodological changes, we start from a previously published probability distribution and apply the changes one at a time. For a reference point, we start with the PDF from Libardoni et al. (2018a) derived using the HadCRUT3 surface temperature dataset (Brohan et al., 2006) and the likelihood function presented earlier in Sect. 2. The changes we implement are to (i) change the ocean diagnostic from the 0-3000 m layer to the 0-2000 m layer, (ii) replace the interpolation method of Forest et al. (2006) with the RBF interpolation method, and (iii) change from using natural variability estimates from different control run models for the surface and ocean diagnostics to a common model for both estimates. To better illuminate the changes, we derive an additional PDF changing both the control run model and the interpolation method simultaneously. We summarize the resulting distributions in Fig. 6.
When changing the ocean diagnostic from the 0-3000 m layer to the 0-2000 m layer, we observe the largest change as a shift towards higher K v . As measured by the 90 % credible interval for the marginal distribution of √ K v , our estimate increases from 0.29-1.90 to 0.81-3.22 cm s −1/2 . We also note that the wider interval indicates a weaker constraint on the estimate of K v . In the MESM, K v controls how fast heat is mixed into the deep ocean. Thus, we trace the shift towards higher K v to the stronger heating rate in the ocean diagnostic due to estimating the trend from the 0-2000 m data (Fig. 2). We observe a small shift towards higher ECS and almost no change in estimates of F aer .
For the second change, we explore the implementation of the RBF interpolation algorithm. In Fig. 6, we observe that the parameter distributions are indeed smoother when the RBF method is used. This is particularly evident in the climate sensitivity distributions. We also note changes to the constraints on model parameters. In general, we see a flattening of the center of the distributions, as measured by the interquartile range (IQR). In particular, the IQR for √ K v increases from 0.59 to 0.71 cm s −1/2 (ranges of 0.71-1.3 to 0.86-1.57 cm s −1/2 ) and for F aer from 0.07 to 0.11 W m −2 (−0.25-−0.18 to −0.32-−0.21 W m −2 ) when comparing the reference PDF using the old interpolation method to the PDF estimated using the RBF method. This increase is consistent with our previous discussion that the RBF method tends to adjust low r 2 values upwards and high r 2 values downwards. In this situation, the maximum likelihood region of the joint PDF, where r 2 is a minimum, impacts all points within its radius of influence.
In general, we observe tighter constraints on all of the distributions when a common control run model is used for the surface and ocean diagnostics. For all three parameters, the width of the 90 % credible interval decreases. One potential reason for these tighter constraints is an undersampling of the internal variability resulting from using only CCSM4's variability and not across multiple models. Due to structural differences, the internal variability is not the same across all models and a single model does not span the full range of variability. We investigate the sensitivity of the distributions to the internal variability estimate in a separate study (Libardoni et al., 2018b).
Despite the tighter constraints, we observe multiple minima and maxima in the climate sensitivity distribution. All of the local extrema occur at values of ECS where the model has been run. We attribute these oscillations to the spline interpo-lation method attempting to pass through r 2 exactly at all of the points and observe them in plots similar to Fig. 3 for different aerosol levels (not shown). In addition to the method developed in this study, using a smoothing spline is another interpolation method that can eliminate these multiple extrema. Because the assumed impact of the old interpolation method leads to the spurious ECS marginal distribution, we also show the case where both the control run and interpolation method are changed together (purple curve in Fig. 6). This test also separates the impacts of changing datasets and diagnostics (ocean dataset) from the technical details of the derivation (interpolation method and variability estimate).
We summarize the net impact of the changes by implementing all three simultaneously (red curve in Fig. 6). When comparing the ECS and F aer distributions, we observe very little change in the estimates of central tendency and stronger constraints on the parameters. Here, we measure central tendency by the median of the distribution and the constraint by the width of the 90 % credible interval. Before implementing the changes, we estimate the median ECS to be 3.44 • C with a 90 % credible interval of 2.24-5.48 • C. After the changes, we estimate a median of 3.45 • C and a 90 % credible interval of 2.54-4.96 • C. Similarly, for F aer we estimate a median of −0.22 W m −2 and a 90 % credible interval of −0.38-−0.11 W m −2 before and a median of −0.23 W m −2 and a 90 % credible interval of −0.38-−0.11 W m −2 after the changes. This pattern does not hold for the K v distribution.

√
K v , we estimate the median to increase from 1.00 to 1.77 cm s −1/2 and the 90 % credible interval to change from 0.29-1.90 to 1.03-3.32 cm s −1/2 when implementing the new methodology. We previously showed that the change in ocean dataset led to higher K v estimates without changing the central estimates of the other two parameters. Combining this with the findings from the ECS and F aer distributions leads us to conclude that the central estimates of the distributions change with the diagnostics, and that the technical changes, namely the unforced variability estimate and the interpolation method, impact the uncertainty estimates.

Temporal changes to model diagnostics
Before presenting new PDFs using the methods discussed in the previous section, we present the model diagnostics used to derive them. We show the time series of decadal mean temperature anomalies with respect to the 1906-1995 climatology in the four equal-area zonal bands of the surface temperature diagnostic (Fig. 7). We plot the time series from 1941 to 2010 with the decadal mean plotted at the midpoint of the decade it represents. In tests where we extend the model diagnostics by holding the start date fixed and add additional data, we add an additional data point to the end of each time series. In tests where we hold the length of the diagnostics fixed while adding recent data, we change which five data points are used.

28
A. G. Libardoni et al.: Estimates incorporating recent change 30°-90° N 1950 1960 1970 1980 1990 2000 2010 Year From the time series, we see that while general similarities exist, the model diagnostic depends on which surface observations are used. Across all datasets, we observe the largest signal in the 30-90 • N zonal band, consistent with the polar amplification of warming. We also note that the highest agreement across the datasets is observed in this band. We find that there is a separation between the time series in the 0-30 • N and 0-30 • S zonal bands based on which SST dataset a group used for the temperatures over the ocean. When considering this split, we see similar patterns in the tropical bands between datasets using HadSST (HadCRUT4 and BEST) and datasets using ERSST (MLOST and GIS-TEMP250). Although not shown, we observe similar patterns in the hemispheric and global mean time series, with a stronger warming signal in the Northern Hemisphere and the time series showing sensitivity to the dataset.
We illustrate how additional data impact the estimate of the linear increase in ocean heat content (Figs. 8 and 9). In both figures, we plot the time series from Levitus et al. (2012) with the pentadal mean plotted at the midpoint of the 5-year period defining the pentad. In Fig. 8, we fix the starting date in 1955 and shift the end date further ahead. In Fig. 9, we fix the length of time over which the linear trend is calculated and shift the entire range forward.
The recent acceleration of heat stored in the deep ocean is well documented (Levitus et al., 2012;Gleckler et al., 2016), and as expected, we find that the trend estimate depends on both the end points of the period used for estimation and the length of the period used for estimation. As previously stated, more recent observations have a stronger influence on Global mean ocean heat content (10 J) 22 Figure 9. As in Fig. 9, except the diagnostic length is held fixed. Linear trend estimates are for the 1955-1990 (black), 1965-2000 (red), and 1975-2010 periods. the trend estimate because the standard error of the observations decreases with time. We calculate higher trend estimates when holding the period length fixed while including more recent data compared to when the period is extended to include more recent data. We estimate a trend of 3.4 ± 0.28 ZJ yr −1 when considering the period from 1955 to 1990. For diagnostics ending in 2000, we estimate a trend of 4.0 ± 0.19 ZJ yr −1 if the starting date is shifted to 1965 and a trend of 3.7 ± 0.15 ZJ yr −1 if the starting date is held at 1955. Trends of 6.0 ± 0.18 and 5.2 ± 0.12 ZJ yr −1 are es-timated when using data up to 2010 and holding the diagnostic length fixed and extending the diagnostic length, respectively. By shifting the diagnostic rather than extending it, the accelerated warming signal is stronger because periods of slower warming earlier in the time series are replaced by periods of more rapid warming later in the time series. For each surface and ocean diagnostic set, we derive joint probability distributions according the experiments discussed in Sect. 2. To account for the different surface temperature datasets, we derive a PDF using each of the four datasets as observations in the surface temperature diagnostic. We combine the four PDFs into a single estimate by taking the average likelihood at each point in the joint PDF. In offline calculations, we confirmed that the marginal PDFs for each parameter using the average joint PDF were nearly identical to the marginal PDFs resulting from the merging method used to submit the distributions from Libardoni and Forest (2013) for inclusion in the Intergovernmental Panel on Climate Change Fifth Assessment Report (IPCC AR5, Collins et al., 2013). For the IPCC AR5 estimates, we drew a 1000member Latin hypercube sample from each distribution and calculated marginal distributions for each parameter from the histogram of the drawn values. By including an equal number of samples from each distribution, we assign equal weight to each surface temperature dataset and make no assumption or judgement about whether any dataset is better or worse than the others. Taking the average of the four PDFs is the limit of this method as the number of draws approaches infinity. We justify using the average of the four PDFs by noting that the same general conclusions are drawn from the combined PDF as would be drawn from the PDFs derived from individual datasets.
We first investigate the PDFs by looking for correlations between the model parameters. For each pair of model parameters and for each configuration of the model diagnostics, we calculate the two-dimensional marginal distribution by integrating over the third parameter (Fig. 10). From these distributions, correlations between the pairs of parameters are evident, independent of the diagnostic length and end date. We find ECS and K v to be positively correlated, ECS and F aer to be negatively correlated, and K v and F aer to be positively correlated. These correlations make sense when related to the model diagnostics. If we take a fixed surface temperature pattern and conduct a thought experiment for each pair of parameters, the correlations emerge when considering the energy budget at the atmosphere-ocean interface. For a fixed forcing, if climate sensitivity increases, surface temperatures would increase in response to the more efficient heating of the surface. Because these higher temperatures no longer agree with the fixed temperature pattern, a mechanism for removing excess heat from the surface is needed to reestablish balance in the system. In the MESM framework, this mechanism is more efficient mixing of heat into the deep ocean, and thus higher values of K v . If we fix K v and again increase ECS so that surface temperatures would increase in  response, the mechanism for reducing the energy budget at the surface is the aerosol forcing. To maintain the necessary balance at the surface, F aer needs to be more negative, and is thus negatively correlated with ECS. Lastly, if ECS is fixed, an increase in K v would remove energy from the surface and tend to cool temperatures. A weaker (less negative) aerosol forcing is needed to maintain the energy balance, indicating that K v and F aer are positively correlated. Similar arguments follow when considering the ocean heat content diagnostic and the energy budget of the ocean.
Second, we show that incorporating more recent data into the temperature diagnostics has a significant impact on the individual parameter estimates by investigating the marginal PDF of each parameter (Fig. 11). Unless otherwise noted, we again approximate the central estimate of the distributions as the median and use the 90 % credible intervals to estimate the uncertainty. Across all three parameters, we generally observe sharper PDFs as more recent data are added. Furthermore, the constraints are stronger when the data are used to extend the diagnostics as opposed to when the diagnostic lengths are fixed. We attribute the general tightening of the distributions with recent data to the strong climate signals that have emerged in the observations. Further, we argue that the uncertainty bounds tend to be tighter when the diagnostic lengths are increased because the model output is being compared against more detailed temperature patterns with additional data points to match. Runs that do not match the added points are rejected for being inconsistent with the observations. For climate sensitivity, we find that extending the data beyond 1990 leads to higher climate sensitivity estimates when compared to the estimate shown in Fig. 6 that incorporates all of the methodological changes. However, we find that the inclusion of more recent data does not always lead to an increase in the estimate of ECS. Our estimate of ECS for diagnostics ending in 2000 is greater than the estimate for the diagnostics ending in 2010, regardless of whether the diagnostic length is extended or fixed. For the case where the diagnostics are extended, we estimate a median climate sensitivity of 4.04 • C with data ending in 2000 and 3.73 • C with data ending in 2010. When the diagnostic length is fixed, we estimate median climate sensitivities of 4.08 and 3.72 • C for diagnostics ending in 2000 and 2010, respectively. We hypothesize that the lowering of the estimate for ECS with diagnostics ending in 2010 can be attributed to the slowing of global mean temperature rise in the 2000s as more heat was stored in the deep ocean. We also note the uncertainty in the estimate of ECS decreases as more recent data are added and the tighter uncertainty bounds come predominantly from a reduction in the upper tail of the distribution. There is also a slight increase in the estimate of the lower bound, however.
Our estimates of K v show large shifts in response to changes in the diagnostics. When the diagnostics end in 1990, we find a very weak constraint on K v , with a non-zero tail throughout the domain. As more recent data are included, we see a large reduction in the upper tail of the distributions. We also see shifts towards higher K v with the inclusion of data from 2001-2010. When including these data, we estimate √ K v to increase from 1.45 to 2.08 cm s −1/2 when the diagnostic lengths increase and from 1.16 to 1.62 cm s −1/2 when the diagnostic lengths are fixed. Because K v sets how fast heat is mixed into the deep ocean in the model, we attribute the higher estimates to the recent acceleration of heat storage in the 0-2000 m layer (see Figs. 8 and 9).
We also see shifts in the F aer distribution in response to the changes in model diagnostics. We reiterate that in the MESM, F aer sets the amplitude of the net anthropogenic aerosol forcing and represents the sum of all unmodeled forcings. We observe shifts towards stronger cooling (more negative values of Although not shown, we observe these shifts in the F aer distributions for each of the PDFs derived using the different datasets individually, but note that we see smaller changes with the merged PDF. Also, from the individual PDFs, we see a grouping of the F aer distributions based on the SST dataset used by the research center. We find the HadCRUT4 and MLOST distributions (HadSST) and the BEST and GIS-TEMP250 distributions (ERSST) to be similar.
We attribute the shift towards stronger cooling for the 1991-2000 decade to the cut-off of the high K v tail. When K v decreases, excess heat in the Earth system is stored in the ocean less efficiently. In response to this excess heating, surface and atmospheric temperatures would rise unless an external factor is active and opposes the heating. In the MESM, negative values of F aer reduce the net forcing and contribute to balancing the global energy budget. The spatial pattern of the net aerosol forcing in the MESM leads to the forcing being stronger in the Northern Hemisphere than in the Southern Hemisphere. With this pattern, we observe stronger temperature responses in the Northern Hemisphere when we adjust F aer than we do in the Southern Hemisphere. We attribute the shift back towards weaker aerosol cooling when adding the 2001-2010 trends to the northern hemispheric polar amplification signal noted earlier in this section.
Finally, we derive estimates of transient climate response from the PDFs discussed above (Fig. 11d). From each PDF, we draw a 1000-member Latin hypercube sample and calculate TCR for each of the ECS-

√
K v pairs using the model response surface derived in Libardoni et al. (2018a). The PDFs of TCR are estimated from the histogram of TCR values with bin size = 0.1 • C. We show that the TCR estimates reflect changes in the parameter distributions. In particular, TCR and climate sensitivity are positively correlated and TCR and K v are negatively correlated. Furthermore, the uncertainty in the TCR distribution is correlated with the uncertainty in ECS and K v . Thus, we find that TCR estimates are greater when more recent data are added due the higher climate sensitivity estimates, but are smaller in 2010 than in 2000 due to the shift towards higher K v . Furthermore, TCR estimates are higher when the diagnostic lengths are fixed compared to when they are extended.

Spatial changes to model diagnostics
Until now, we have only considered how the temporal component of the diagnostics impacts the parameter estimates. As a final case study, we reduce the spatial dimension of the surface temperature diagnostic by replacing the four zonal band diagnostic with either global mean surface temperature or hemispheric mean temperatures using the 1941-2010 diagnostic period (Fig. 12). Similar to the PDFs shown when changing the temporal structure of the diagnostic, we present the distributions calculated from the average of the four individual PDFs derived using the different surface temperature datasets.
We find little sensitivity in the central estimates of the ECS and K v distributions to the spatial structure of the surface diagnostic using data up to 2010. For ECS, the median estimate for when global mean temperatures, hemispheric means, and four zonal bands are used are 3.81, 3.75, and 3.72 • C, respectively. Similarly, median estimates for √ K v are 2.06, 1.94, and 2.08 cm s −1/2 when global mean, hemispheric mean, and four zonal mean temperatures are used. However, we observe a tightening of the distributions as the spatial resolution of the surface diagnostic increases. The narrowest distributions are derived using the four zonal band diagnostic and the widest distributions are derived using global mean temperatures. We note that the TCR distributions follow the shifts in ECS and K v . Thus, the central estimates do not change significantly, but the width of the distribution shrinks as spatial information is added to the surface diagnostic.
Unlike with the ECS and K v distributions, we observe a sensitivity to the surface diagnostic structure in the F aer distributions. In particular, we observe that the estimate derived using global mean temperature leads to the strongest (most negative) aerosol forcing and the estimate derived using the four zonal bands leads to the weakest aerosol forcing. When considering only global mean temperature, we remove the polar amplification signal from the temperature diagnostic. Removing this signal means that we ignore the spatial dependence of the aerosol distribution and only consider the net effect on the global energy budget. However, as we include variations of temperature with latitude, the spatial pattern of the aerosol forcing pattern matters. As a result, the median estimate of F aer shifts from −0.31 to −0.28 to −0.23 W m −2 when global mean, hemispheric mean, and four zonal bands are used. Thus, while the spatial structure has only a small influence on ECS and K v , it has a strong influence on F aer .

Conclusions
We implement a number of methodological changes to improve probability estimates of climate system properties. Changes include switching to an interpolation based on radial basis functions, estimating natural variability from a common model across diagnostics, using new observational datasets, and incorporating recent temperature changes in model diagnostics. We show that the parameter estimates follow signals in the data and depend on the model diagnostics. Furthermore, we show that the technical changes, namely the interpolation method and the natural variability estimate, do not considerably change the central estimate of the parameters, but do impact the uncertainty estimates of the distributions.
We have shown that the RBF interpolation method is successful in smoothing the distributions while not changing the central estimate. The success of the RBF method is an encouraging sign for future research. Due to the twodimensional interpolation method previously used, our work until now has been restricted to running ensembles on a uniform grid of points in the parameter space. The RBF method is three-dimensional and can be applied to any collection of node points. We can thus run the full model at any set of nongridded nodes and interpolate the goodness-of-fit statistics to estimate the values at intermediate points. Other studies (Sansó and Forest, 2009;Olson et al., 2012) have built statistical emulators to approximate model output at non-node parameter settings for each point in the diagnostic time series and then calculate the likelihood function by comparing the emulator output to observations. We argue that by interpolat-ing the metrics, rather than model output at individual points in the time series, we approximate the impact of all feedbacks on the diagnostic together, rather than individually at different spatial and temporal scales.
Our results suggest that the spatial structure of model diagnostics plays a key role in the estimation of parameters with spatial variation. When adding spatial structure to the diagnostics, we observed little change in parameters representing global mean quantities (ECS and K v ), but the distributions of F aer differed depending on whether global mean temperature, hemispheric mean temperatures, or temperatures in four equal-area zonal bands were used. When global diagnostics are used, we ignore the spatial variation of forcing patterns and fail to account for regional influences on climate change. Our estimates provide an assessment of the importance of these spatial patterns when estimating probability distributions for model parameters.
Overall, our work highlights that recent temperature trends have a strong influence on the parameter distributions. In particular, we observe a shift in the distributions towards higher climate sensitivity due to the addition of recent surface temperature warming trends relative to 1990, but with a reduction in the estimate when using data up to 2010 as opposed to 2000. We also observe that the distributions of K v shift towards higher values. The uncertainty in our estimates decreases as more recent data are used in the temperature diagnostics. Our estimates of transient climate response reflect the changes in ECS and K v and are correlated with ECS and anticorrelated with K v . By incorporating more recent data, which are of higher quality, and using improved methodology, we are more confident in our estimates of the model parameters and transient climate response.
Code and data availability. The source code of MESM will become publicly available for non-commercial research and educational purposes as soon as a software license that is being prepared by the MIT Technology Licensing Office is complete. For further information contact mesm-request@mit.edu. All data required to reproduce the figures in the main text and scripts to replicate the figures are available. Model output is available upon request.

Appendix A: Grid spacing in normalized model parameter space
As discussed in Sect. 2, when estimating r 2 at intermediate points, the weight assigned node point values in the radial basis function interpolation is a function of the distance between the two points. We have normalized the parameter space for each parameter by the range sampled in the 1800member ensemble of MESM runs so that each dimension is isometric in the distance calculation. In this normalized space, the grid spacing for each model parameter is K v = 1 cm s −1/2 8 cm s −1/2 = 0.125, The weight of any node point in the calculation of r 2 at an interpolated point is given in Eq. (5) and is a function of the distance between the points and the scaling parameter . When first developing the algorithm, we hypothesized that having each node point influence the r 2 value at an interpolated point within three grid points in model parameter space would achieve the fit and smoothness we sought from the interpolation. Because the grid spacing in normalized space is not equal for the three parameters, we chose an average of the three individual spacings and used 0.1 as the approximate distance of one grid space. Setting d = 0.3 and φ = 0.01 to account for the distance between three nodes and the weight approaching zero at that distance, respectively, we solve for = 7.2. To test other values, we scaled the original choice by factors of 0.5, 1.5, and 2. For = 3.6, we calculate an e-folding distance of 0.27. This implies a large sphere of influence, as the weight decays to 0.37 at a distance of approximately three grid points away in normalized parameter space. Thus, rather than decay to zero as for the original estimate, there is still significant influence from the node point at d = 0.3. This leads to the over-smoothing of the r 2 pattern observed in Fig. 3. In similar calculations, we determine e-folding distances in normalized parameter space of 0.09 and 0.07 for = 10.8 and = 14.4, respectively. For = 10.8, this implies an e-folding distance of approximately one grid space in the √ K v and F aer dimensions, while for = 14.4, the weight has decayed to 0.13 at a distance of one grid space in those dimensions. Using larger values of leads to further decay of the weighting function one normalized grid point away from the nodes. We chose s of 17.2 and 21.2 to demonstrate this feature.