To study climate change on multi-millennial timescales or to explore a
model's parameter space, efficient models with simplified and parameterised
processes are required. However, the reduction in explicitly modelled
processes can lead to underestimation of some atmospheric responses that are
essential to the understanding of the climate system. While more complex
general circulations are available and capable of simulating a more realistic
climate, they are too computationally intensive for these purposes. In this
work, we propose a multi-level Gaussian emulation technique to efficiently
estimate the outputs of steady-state simulations of an expensive atmospheric
model in response to changes in boundary forcing. The link between a
computationally expensive atmospheric model, PLASIM (Planet Simulator), and a cheaper model,
EMBM (energy–moisture balance model), is established through the common boundary condition specified by an
ocean model, allowing for information to be propagated from one to the other.
This technique allows PLASIM emulators to be built at a low cost. The method
is first demonstrated by emulating a scalar summary quantity, the global mean
surface air temperature. It is then employed to emulate the dimensionally
reduced 2-D surface air temperature field. Even though the two atmospheric
models chosen are structurally unrelated, Gaussian process emulators of
PLASIM atmospheric variables are successfully constructed using EMBM as a
fast approximation. With the extra information gained from the cheap model,
the multi-level emulator of PLASIM's 2-D surface air temperature field is
built using only one-third the amount of expensive data required by the
normal single-level technique. The constructed emulator is shown to capture
93.2 % of the variance across the validation ensemble, with the averaged RMSE
of 1.33

Complex computer simulations are used in climate research to improve our understanding of the climate system. They are often used to project future changes in global temperature, corresponding to different emission scenarios. Our confidence in these projections is highly dependent on how reliable the simulations are. For example, the study of palaeoclimate offers an insight into the Earth's past climate system and also provides valuable out-of-sample data to validate our simulations. However, this requires running complex simulations on multi-millennial timescales, which is computationally demanding. For most coupled atmosphere–ocean general circulation models (AOGCMs), this is currently not feasible. Other studies such as uncertainty and sensitivity analysis or history matching require a thorough exploration of the input parameter space. The class of fast models, known as Earth system models of intermediate complexity (EMICs) is suitable for these types of studies. Their efficiency is achieved by a combination of lower spatial and/or temporal resolution and the use of simplified parameterisations. However, depending on the nature of the questions asked, these lower fidelity models might be insufficient.

To address this issue, an emulator is often employed to provide a statistical
estimation of the expensive model's response without the need to perform a
new simulation. Even then, this approach becomes impractical when the models
of interest are very computationally intensive. In order to build a reliable
emulator, a certain number of simulations is needed to provide the basis upon
which the emulator is built. This number can be large, especially when
multiple model parameters are varied or when the model's climate response
exhibits non-linear behaviours. For a computationally expensive GCM, a
sufficient number of simulations are often not affordable. This paper
describes an efficient emulation process that utilises the connection
between models of different complexities. The idea is to establish a
traceable hierarchy, using an emulator for the simple model to construct an
emulator of the more complex one

While the high-fidelity (complex) model is computationally expensive, the low-fidelity (simple) model is cheaper to evaluate and can be sampled more finely
across the input space, providing extra information where expensive data are
sparse. The models forming this hierarchy can be structurally related or
structurally unrelated. Models are referred to as structurally related when
they are from the same family of code but have different resolutions. These
models might have other differences resulting from the change in mesh
resolution. Examples of such models are the HadCM3 (Hadley Centre Coupled Model version 3)

The following work illustrates the use of a method that combines multi-level emulation with a dimensional reduction technique through an example study using GENIE-1, from the Grid ENabled Integrated Earth system modelling framework (GENIE), and PLASIM (Planet Simulator). GENIE-1 and PLASIM are chosen in this case since they are both suitable for Earth system modelling for long timescales, but are structurally different. PLASIM's atmosphere is also substantially more complex and thus, computationally more expensive than GENIE-1's energy–moisture balance model, EMBM, of the atmosphere. EMBM incorporates the vertically integrated energy–moisture balance equations while PLASIM is based on the moist primitive equations representing the conservation of momentum, mass and energy. EMBM, therefore, is not capable of producing air temperature and pressure at different altitude or an interactive cloud and wind field. The hierarchy formed by these two models is exploited using the multi-level technique, allowing us to construct an emulator of PLASIM atmospheric variables at a reduced cost. Specifically, Gaussian process emulators are used to obtain the statistical relationship between the response of the EMBM atmosphere and the PLASIM atmosphere to changes in their boundary conditions (sea surface temperature, long-wave and shortwave radiative forcing). This ability of this relationship to predict behaviour of PLASIM atmosphere, in the absence of feedbacks on other climate system components, is then assessed. The dimensional reduction technique is employed to extend the emulation method for prediction of high-dimensional outputs in addition to scalar summary quantities.

Once constructed, the emulators provide estimates of simulation results, at
untried combinations of the inputs, as finely as needed, at a low cost. This
enables statistical methods such as history matching

In this study, we utilise the atmospheric component of GENIE-1
(version 2.7.8)

The configuration of GENIE-1 employed here couples a single layer EMBM
atmosphere to a 3-D frictional geostrophic ocean model with linear drag
(GOLDSTEIN) and a thermodynamic, advection–diffusion sea-ice model (GOLDSTEIN
sea ice). The ocean component is run at

The parameterisation of atmospheric transport of heat and moisture in EMBM is
done by diffusion. Moisture can also be advected by a prescribed monthly
climatological wind field. This wind field is fixed and is the same for all
simulations in EMBM. The effect of cloud cover on incoming shortwave
radiation is captured through a prescribed albedo field, diagnosed from
reanalysis data

The atmosphere of PLASIM–ENTS

PLASIM is run at T21 resolution, which corresponds to a triangular truncation
applied at wave number 21. It is almost an exact match of GENIE-1's

For our study, surface output fields of GENIE-1, namely, sea surface temperature (SST), fractional sea-ice coverage (SIC) and sea-ice thickness (SIH) are used to drive PLASIM. This means that the atmospheric circulation can change according to the underlying sea surface temperature and sea-ice condition but cannot influence the ocean or sea-ice physical state. This constrains PLASIM responses to a certain extent. The atmospheric responses of EMBM and PLASIM to the same set of physically plausible boundary conditions are compared and emulated. The surface air temperature (SAT) from EMBM atmosphere is treated as a fast approximation of PLASIM SAT when multi-level emulation is applied.

Ten of the chosen parameters, with the exception of ICF
and RFC, are taken from an ensemble design used in

To explore emulator performance in situations where the climate states are
very different from modern conditions, an ensemble is designed to fill a
large input space; 12 model parameters and one dummy variable are varied,
either linearly or logarithmically, over the ranges indicated in
Table

The first parameter (ICF) represents the boundary condition of the glacier
coverage as well as the corresponding orography at different a snapshot in time
extending from the present (

Mixing and transport in the ocean are controlled by the isopycnal and
diapycnal diffusivity parameters (OHD and OVD, respectively), a momentum drag
coefficient (ODC) and a wind scaling factor (WSF)

APM is a flux correction responsible for transporting freshwater from the
Atlantic to Pacific, affecting deep water sinking in the North Atlantic and
hence the strength of the AMOC

In addition to these 12 model parameters, a dummy parameter is included for
statistical validation purposes, which will be discussed in more detail in
Sect.

First, all input parameters are normalised to

A summary of the simulated climate states from the 600-member ensembles of GENIE-1 with EMBM and PLASIM.

Each member simulation of this ensemble is run for 5000 years to reach a
steady state; 600 simulations were completed successfully, producing a large range
of climate responses, which are summarised in Table

A second MLH design is generated in the same parameter space, producing 214 successful simulations, for validation purposes. The emulator predictions at these points are compared against the simulated values to assess the performance of the emulators.

The mean and standard deviation of SST and fractional ice coverage across the 600-member ensemble. The SST and sea-ice coverage are prognostic output of GENIE-1 while the land ice coverage is regridded from Peltier ICE-5G. These fields, among others, are applied as surface boundary conditions to drive PLASIM atmosphere.

For each successful GENIE-1 simulation, surface output fields are extracted
and used to force PLASIM for another 35 years. Each sampling plan, therefore,
produces two equivalent ensembles of EMBM and PLASIM outputs. The fields used
to initiate PLASIM simulations are SST, SIC and SIH as mentioned in
Sect.

Both ensemble designs are larger than needed in this case. On average, 10 simulations are needed for each parameter being varied. Since 13 parameters
are perturbed, a 130-member ensemble would be sufficient. There are several
reasons why a 600-member ensemble was used. First, the number of simulations
required ultimately depends on the variations of the variable of interest
within the specified parameter space. If this variable behaves non-linearly
and exhibits a bifurcation, more simulations would be required to capture
such behaviour accurately. Second, the required number of simulations of
cheap and expensive models are unknown. Different combinations of subsets
with varying sizes are used and compared in the following section. It is
ideal to generate a new design separately for each case but this is highly
inefficient and will result in an large incoherent ensemble with low
reusability. Therefore, it is preferable to start with a large design from
which different subsets can be chosen. These subsets are all subjected to the
same maximin criteria mentioned above. The algorithm used is covered in
Sect.

In a computer experiment, the model outputs at some
combinations of input parameters are considered as observations. An emulator
is a statistical surrogate of a model, which is generally much cheaper to
evaluate and, once validated, can be used in place of the full model to
predict the observation at untried choices of inputs. Our interest focuses on
the Gaussian process (GP) emulator, also known as kriging

To emulate a single summary quantity of the simulation outputs, for example,
the global mean SAT, the assumptions made are as follows:

The model output is a smooth function of its inputs.

The model can be represented as a GP.

Each emulator is concerned with a single deterministic scalar output.

The climate model,

The covariance function is given by

The value of

The specified GP is used as a prior for Bayesian inference and is
parameterised in terms of the hyperparameters

Prior beliefs about the model behaviour are combined with observations from
training points to produce a posterior distribution for the model. Having
obtained estimates for

The exponential power form of covariance structure used here is a common choice due to its flexibility. Its assumption on stationary might fail, for example, when there is a bifurcation in the system. The covariance specified, however, provides a weak prior and as more training points are used, it contributes less to the final emulator.

Co-kriging is an extension to the previously described technique, which is
applicable when a fast approximation of the primary simulator is available.
In order for this method to work, the primary simulator and its approximation
need to fulfil an additional assumption:

The different levels of code are correlated and contain information about one another.

When only a small number of expensive runs is available, it has been shown that by combining these with cheaper runs from a simplified code, an emulator of the expensive model can be built at a lower costWe make a simplification that the expensive and cheap models,

Two sets of training points are required for the construction of a co-kriging
emulator, a cheap set

When the number of PLASIM training points is small, such that a kriging
emulator cannot be built with high accuracy, co-kriging employing an
additional large number of training points from GENIE-1's EMBM can be used
instead. The number of points required depends on the size of the problem as
well as the smoothness of the function being emulated. The inputs at which
the expensive training set is obtained,

The covariance matrix for co-kriging,

Both kriging and co-kriging emulators are constructed using readily available
software from

So far, we have only discussed the use of GP emulators for single outputs.
This can be a summary quantity such as the strength of the AMOC or the global
average SAT

Climate variables at different spatial or temporal locations can be emulated
independently

In this work, we use principal component analysis (PCA) via singular value
decomposition (SVD) to transform the high-dimensional data into a meaningful
representation with lower dimensionality. While there are several techniques
to accomplish this task, PCA is efficient and has the advantage that the
leading components explain the majority of the variance across the ensemble

For each ensemble member, our field of interest, SAT, with dimension

The EOFs and PCs of EMBM and PLASIM SAT can be obtained by decomposing each
set separately. However, we are interested in using EMBM's PCs as the cheap
approximation of PLASIM's values; therefore, the SAT fields from both models are
projected onto the same orthogonal basis vectors defined by PLASIM's EOFs.
This gives a new set of PCs for EMBM's SAT:

The top (or high order) EOFs explain most of the variance in the data such
that the dimension of

The prediction,

The EMBM output SATs are averaged over the final year of the 5000-year
simulations while PLASIM output fields are averaged over the last 30 years.
The ranges of some output variables obtained from the 600-member ensembles of
GENIE-1 and PLASIM simulations are summarised in Table

The mean and standard deviation of SAT across the 600-member
ensembles of GENIE-1 and PLASIM. There are white cells on the PLASIM SD plot
where the outputs go beyond the plotted range. The largest standard deviation
in PLASIM is 17.5

Figure

The differences seen in Table

Taylor diagrams showing a comparison between model runs with
climatology: GOLDSTEIN SST (left), EMBM SAT (middle) and PLASIM SAT (right).
The magenta dots represent reanalysis taken from Locarnini climatology
(1900–2005)

The resulting SAT from both models are compared against climatology in
Fig.

The simulated pattern of SST correlates well with observation (average correlation coefficient of 0.95), while the majority of the ensemble exhibits smaller spatial variability than climatology (average normalised SD of 0.85). The spread in these modern GOLDSTEIN SST points is due to the large range of the varied GENIE-1 parameters. The standard deviations of SAT are also underestimated in EMBM (average normalised SD of 0.83). PLASIM SAT correlate well with the climatology (average correlation coefficient of 0.97). The spatial variation in PLASIM SAT has a similar mean to EMBM but has a larger range (both ensembles have average normalised SD of 0.83).

An emulator is first constructed for EMBM global mean SAT
with a starting number of 30 training points. The coefficients of
determination (

The number of training points required varies from one emulator to another since it depends strongly on the function being emulated. As the number of parameters increases, the dimension of the emulator also increases and hence more training points are required. Typically an average of 10 points per dimension is assumed. This, however, depends on how non-linear or how “active” the function is. A highly non-linear function might require many more points while a more linear function might not need as many as 10 points per dimension.

Kriging emulators using only expensive points are also constructed to provide
comparison between the two techniques. When the same amount of training data
is used, co-kriging outperforms kriging. More expensive points are then added
to improve the kriging emulator until a similar value of RMSE is obtained. In
this case, the kriging emulator using

Validation results for kriging and co-kriging emulators of PLASIM global mean SAT. The co-kriging emulator uses 50 expensive points and 200 cheap points while the kriging emulator here uses the same 50 expensive points.

Validation results for kriging and co-kriging emulators of PLASIM global mean SAT – SST. The co-kriging emulator uses 70 expensive points and 250 cheap points while the kriging emulator here uses the same 70 expensive points.

A second pair of emulators is produced for the global SAT anomaly from SST
(global annual mean SAT minus SST). In this case, the component of the SAT
response that is a trivial function of the boundary conditions is removed.
Following the procedure described above, a co-kriging emulator using 70 expensive points and 250 cheap points were constructed and compared to a
kriging emulator using only 70 expensive points. The RMSE and

The upper panels show PLASIM simulated global mean SAT at the 214 validation points plotted against their emulated values from both kriging (left) and co-kriging (right) emulators. The error bars indicate a 2 standard deviation interval at each point. The lower panels show the results of the global mean SAT–SST emulators.

For both kriging and co-kriging emulators using the same expensive training
points, the emulated global mean SATs at the 214 validation points are
plotted against their simulated values (Fig.

While co-kriging outperformed kriging in both cases, multi-level emulation
does a much better job at predicting SAT than SAT minus SST. Nevertheless,
the

The uncertainty in the emulator predictions, arising from not having
evaluated the model at untried input configurations, is called the “code
uncertainty”

The following analysis attempts to explain the processes and parameters that
determine the spatial distributions of SAT in GENIE-1 and PLASIM using PCA.
SVD was applied to two (

Percentage of variance in SAT, explained by the first 10 EOFs for GENIE-1 with EMBM and with PLASIM. The 150-member ensembles are used to obtain these values.

The high percentage of variance explained by the retained EOFs mean that by successfully emulating them, the SAT field of PLASIM can be accurately estimated. For EMBM data to be useful, its EOFs and PCs need to carry meaningful information about PLASIM's modes. To verify this, an analysis of the EOFs and PCs of the two models are carried out.

The first EOFs of EMBM and PLASIM SAT (upper) and the universal
kriging emulator coefficients of their corresponding PCs (lower). All 600
data points are used to train each of these emulators. The black cells in
PLASIM EOF1 indicate values lower than the plotted range. Contours are drawn
over both plots at a 2

The first EOFs of SAT in both models are illustrated in Fig.

The first EOF for both models is of the same sign globally, suggesting a
change in the radiation budget due to the greenhouse gas and the albedo
effects. The effects due to changing glacier condition and atmospheric

The second EOFs of EMBM and PLASIM SAT (upper) and the universal
kriging emulator
coefficients of their corresponding PCs (lower). All 600 data points
are used to train each of these emulators. The white cells in PLASIM EOF2 indicate
values higher than the plotted range. Contours are drawn over both plots at a 2

The second EOFs in EMBM and PLASIM exhibit changes of opposite sign at
Equator and polar regions, reflecting a redistribution of the heat budget
(Fig.

With emulator coefficients of approximately 0, the dummy variable is
correctly identified as an inactive parameter in all cases
(Figs.

These EOFs indicate similar modes of variability in GENIE and PLASIM,
fulfilling the assumption made for co-kriging. The extra training points from
EMBM, therefore, are expected to provide inference on PLASIM's behaviour.
Each pair of PCs from EMBM and PLASIM form a set of cheap and expensive
training data for the corresponding emulator. Even though this is applied to
all 10 PCs, according to Table

Although all 600 data points are used to train each of these emulators, results obtained from smaller subsets show no systematic differences.

The assumptions made for Eq. (

We retained the first 10 EOFs of EMBM and PLASIM SAT, which describe 99.93
and 99.35 % of the simulated ensemble variance, respectively
(Table

Using the same procedure as described in Sect.

Kriging emulators using only the expensive data from PLASIM are also
constructed for comparison. Again, co-kriging outperforms kriging when the
same 50 expensive training points are used. More expensive points are then
added to the kriging emulators and for approximately 150 points, similar RMSE
and

Validation of each PC emulator using the 59-member ensemble. The correlation coefficients show how well matched the emulated PCs are compared with the simulated values. The co-kriging emulator uses 50 expensive points and 150 cheap points while the kriging emulator here uses the same 50 expensive points.

The co-kriging (trained with 50 expensive and 150 cheap points) and kriging
(trained with 50 expensive points) are validated using the 214-member
validation set. Both the individual PCs and the final reconstructed SAT are
validated against true values. First, to test the emulator's ability to
reproduce PC values, each emulated PC is validated against those decomposed
from the simulated ensemble (Table

The 10 co-kriging emulators of PLASIM PCs are then used to reconstruct the
SAT fields at each validation point. To validate the simulated SAT fields,
the quality of the individual emulations and the spatial pattern of the
emulated field are tested. In order to test the proportion of the total
ensemble variance captured by the emulator:

Figure

Comparison between kriging (dashed line) and co-kriging (solid line)
emulators. The variance explained (blue) when each PC is added is shown
together with the RMSE (red) of the corresponding reconstructed validation
SAT fields. The dot-dashed lines represents the same values obtained if
the emulator were perfect. The deviations of these line from
RMSE

Mean and standard deviation of the emulated (upper and middle left) and simulated (upper and middle right) validating ensembles. The emulated–simulated differences in mean (lower left) and standard deviation (lower right) are also shown.

Figure

In the work presented here, only annually averaged fields are considered. The
generalisation to emulate monthly average fields or seasonal cycles is
straight forward. We simply have to replace the current

Mean (upper panel) and standard deviation (lower panel) of the SAT
anomaly corresponding to a double in atmospheric

We have demonstrated that information from a cheap atmospheric model (EMBM)
can be used to improve predictions of the steady-state behaviour of an
expensive atmospheric model (PLASIM) in unsampled parts of
parameter-/boundary-forcing space. This behaviour is a function of the
boundary conditions on the atmospheric model (SST, long-wave and shortwave
radiative forcing), as represented in this statistical study by the 13 parameters. This technique has advantages when attempting to understand or
project the decoupled response of individual climate system components to
their boundary conditions. For example, in the context of impact assessment
models, the spatial pattern of changes in SAT and precipitation is often needed
to study the impact of climate change on areas such as health, land use and
energy production. These spatial temperature and precipitation response
patterns are obtained from climate models forced by arbitrary

In reality, changes to the climate system components that are focused on will feed back on other climate system components; i.e., if the present study were extended to the fully coupled system, differences in SAT, wind stress and the hydrological cycle between PLASIM and the EMBM would feed back on SST and sea-ice distribution.

Within this context, we now explore the relationship between the “climate
sensitivities” of the EMBM and PLASIM atmospheres, both forced by
GENIE–EMBM SSTs as discussed above, before considering how our approach could
in future be extended to the fully coupled system. Our 600-member ensemble
design generated in Sect.

The average SAT anomalies due to a doubling of atmospheric

The upper panel shows the probability distributions of EMBM (red) and PLASIM (blue) climate sensitivities. The mean of each distribution is denoted by the dot-dashed line of the same colour. The lower panel shows a plot of PLASIM anomalies against EMBM anomalies. The coefficients of the linear function fitted through the data are included in the figure.

In a hypothetical coupled experiment, it is reasonable to speculate that the
generally larger response of SAT to

We have described in this paper the development and evaluation of large ensembles of GENIE-1 and PLASIM simulations for application in statistical emulation.

For this work, we employ the non-parametric fitting method of Gaussian
process emulation. Two variations of this well-established method, kriging
and universal kriging, are briefly described in Sect.

To efficiently extend this method from emulating scalar output to emulating high-dimensional output, e.g. the 2-D SAT fields, principal component analysis is used. This powerful technique decomposes the output surface fields of both EMBM and PLASIM models into orthogonal EOFs, scaled by the respective PCs. The EOFs are, however, statistical modes and direct connection to physical processes cannot always be drawn directly. Emulator coefficients of the PCs corresponding to these modes, however, can provide a link between them and the varying model parameters, allowing for better interpretation of the model behaviour. It also allow us to identify and preserve the correlation between grid cells.

Here, the first five PCA modes are emulated instead of individual grid cell values, reducing the computational cost significantly. Although not explored in this work, the links between different model outputs may also be exploited to allow for further reduction of dimension when emulating multivariate output.

A multi-level emulation technique, co-kriging, is used to build both scalar and high-dimensional output emulators for PLASIM with additional information from EMBM. The constructed co-kriging emulators successfully estimate both the global mean SAT and the 2-D array of SAT fields of PLASIM as functions of the 13 GENIE-1 parameters. Being cheaper to evaluate, EMBM can be used to sample GENIE-1's parameter space more finely, providing information where PLASIM data are sparse. Despite being structurally unrelated, the link between EMBM and PLASIM is successfully established, resulting in PLASIM emulators being built using a smaller amount of expensive data. The combination of PCA with co-kriging allows us to emulate accurately the spatial pattern of PLASIM SAT despite the model having a different response to EMBM's. Emulated outputs are validated against simulated values using a separate validation ensemble. Both spatial pattern and magnitude of SAT are well reproduced across the ensemble. Apart from the ensemble mean and standard deviation, individual simulations are also successfully emulated with high accuracy. The emulators, however, show a tendency to underestimate the variance spatially and across the ensemble. This is unavoidable because of the dimensional reduction process. The quantification of the emulator uncertainties are beyond the scope of this paper and should be explored in further studies in order to improve the emulators' performance.

Here, we have focused only on SAT but this method can be applied to other
variables of the atmosphere, such as precipitation (PPTN) or wind fields. In
the case of PLASIM, co-kriging emulation of PPTN using GENIE's PPTN field as
a fast approximation is not likely since the description of this field in the
two models differed quite significantly. The same goes for other PLASIM
quantities, which have no equivalent in EMBM. However, it is possible that
other GENIE-1 fields might be more suitable as the fast approximation to
PLASIM's PPTN, e.g. SST or elevation. Work has been done in the past using
elevation as a fast approximation for PPTN

This work establishes the technique for emulating the equilibrium response of
the model. Compared to available efficient frameworks such as the MIT IGSM-CAM (Massachusetts Institute of Technology – Integrated Global System Model linked with the National Center for Atmospheric Research (NCAR) Community Atmosphere Model)

We have demonstrated that multi-level emulation across structurally unrelated
models provides useful information more efficiently than using either model
in isolation. Several challenges remain before a coupled model making use of
such an emulator can be constructed, and the steady-state vs. transient
issue is one of them. The seasonality, which is currently lacking, will also
be included by the modification described in Sect.

The advantage of the emulation technique used here is that it does not depend on a fix set of models and can be applied to a wide range of models for different applications. It also provides a useful tool in coupling models of different fidelity and resolutions. The emulators, however, are built for specific applications and so care should be taken to avoid extrapolating beyond the emulated space.

In conclusion, the work presented here demonstrates a concept with applications in not only climate research but extending to a wide range of problems where multi-level computer models are available.