Changes in the distribution of annual maximum temperatures in Europe

. In this study we detect and quantify changes in the distribution of the annual maximum daily maximum temperature (TXx) in a large observation-based gridded data set of European daily temperature during the years 1950–2018. Several statistical models are considered, each of which analyses TXx using a generalized extreme-value (GEV) distribution with the GEV parameters varying smoothly over space. In contrast to several previous studies which ﬁt independent GEV models at the grid-box level, our models pull information from neighbouring grid boxes for more efﬁcient parameter estimation. The GEV location and scale parameters are allowed to vary in time using the log of atmospheric CO 2 as a covariate. Changes are detected most strongly in the GEV location parameter, with the TXx distributions generally shifting towards hotter temperatures. Av-eraged across our spatial domain, the 100-year return level of TXx based on the 2018 climate is approximately 2 ◦ C (95 % conﬁdence interval of [ 2 . 03 , 2 . 12 ] ◦ C) hotter than that based on the 1950 climate. Moreover, averaged across our spatial domain, the 100-year return level of TXx based on the 1950 climate corresponds approximately to a 6-year return level in the 2018 climate.


Introduction
The greenhouse effect, whereby increasing levels of greenhouse gases in the Earth's atmosphere lead to a warming of the climate system, has long been understood (Charney et al., 1979), and in 2019, atmospheric CO 2 concentrations were higher than at any time in at least 2 million years (IPCC, 2021b). Allen et al. (2018) estimate that human-induced warming in 2017 reached approximately 1 • C above preindustrial levels and is increasing at a rate of approximately 0.2 • C per decade. Hoegh-Guldberg et al. (2018) describe the impacts of 1.5 • C global warming above pre-industrial levels on natural and human systems. These impacts include an increase in the frequency and intensity of heavy-precipitation events, more frequent marine heatwaves and reduced crop production and yields.
Temperature extremes, which may manifest in more intense heatwaves and enhance the risk of fires, pose a risk to human health (IPCC, 2014, Sect. 2.3.2), with the elderly being particularly vulnerable to heat-related mortality (Basu and Samet, 2002). An estimated 40 000-70 000 heat-related deaths were recorded as a result of the summer of 2003 European heatwave (Fischer and Schär, 2010;Robine et al., 2008), with associated economic losses in excess of EUR 13 billion (de Bono et al., 2004). Due to the potentially devastating consequences, it is clearly important to understand how the frequency and intensity of temperature extremes may change in a warming climate.
Several previous studies consider changes in the probability distribution of daily temperature and infer that similar changes should also hold for extremes. Donat and Alexander (2012) consider the distribution of daily maximum and minimum temperature on a global scale using observational data and find significant shifts in temperature towards higher values in almost all regions but less evidence for changes in variability. Similar conclusions are reported in Weaver et al. (2014), who analyse data from several hundred climate G. Auld et al.: Changes in the distribution of annual maximum temperatures in Europe model runs. Schär et al. (2004) on the other hand argue that an increase in variability in the daily temperature distribution is required to explain the European heatwave of 2003. Kiktev et al. (2003) and Morak et al. (2013) both find that there has been a decrease in the frequency of cold extremes and increase in the frequency of hot extremes, concluding that human-induced forcing has played an important role. Stott et al. (2004) consider human influence on the summer heatwave of 2003 and find that "it is very likely (confidence level > 90 %) that human influence has at least doubled the risk of a heatwave exceeding this threshold magnitude". Zwiers et al. (2011) use observational data together with climate model output in a detection and attribution study of changes in temperature extremes. They consider several variables, including annual maximum daily maximum (TXx) and minimum temperatures (TNx), and find evidence for anthropogenic forcing for all variables they consider, with the biggest changes being detected in TNx. More recently, the IPPC report (IPCC, 2021b) concluded that "human-induced climate change is the main driver" of the increase in intensity and frequency of hot extremes.
In this paper we consider statistical models for the variable TXx at approximately 12 000 locations of a gridded data set in a large subset of Europe. We consider the question of whether, over various large sub-regions of Europe, there is evidence for changes in the distributions of TXx and, if so, how such changes are best described. Our approach can, informally, be viewed as macroscopic, since we are interested in detecting changes in TXx on a large scale rather than at any one specific geographic location. We fit statistical models that allow for changes in both the location and scale of the TXx distributions. A change in the location of the TXx distribution corresponds to a horizontal shift in the distribution, with the mean and all quantiles being shifted by the same amount. A change in scale corresponds to a horizontal stretching or compression of the distribution, which in turn changes measures of variability, such as the variance of TXx. Figure 1 illustrates both of these effects for a hypothetical TXx distribution.
Most of the studies mentioned above treat the data occurring at different geographic locations in an independent manner, fitting separate statistical models to the data at each location. One difficulty with this approach in the context of extremes is that, as extreme observations are by definition rare, we will only have a small sample at each location, making precise estimation of trends problematic. Although it may be unreasonable to assume a common trend at every geographic location of a large spatial domain, we would nonetheless expect nearby regions to be similarly affected by climate change. There are several classes of models, such as varying coefficient models (Hastie and Tibshirani, 1993) or geographically weighted regression models (Brunsdon et al., 1998), that allow us to borrow strength from neighbouring locations to obtain spatially coherent estimates of trends. Varying coefficient models allow for regression coefficients, e.g. trends, to vary smoothly over a spatial domain and may be formulated under the generalized additive model (GAM) framework of Wood (2017) and consequently fit with the R (R Core Team, 2021) package mgcv. As we work with a gridded domain, we consider a discrete analogue of varying coefficient models that are based on Gaussian Markov random fields (Rue and Held, 2005) and fall under the general smooth modelling framework of Wood et al. (2016). Previous studies that make use of GAMs or smooth models for modelling environmental extremes include Chavez-Demoulin and Davison (2005) and Youngman (2019).
The lack of availability of high-resolution, continentalscale, temporally complete and homogenized observational data, together with the impracticality of performing largescale controlled experiments on the climate system, means that climate researchers often rely on gridded data products (New et al., 2002). Common types of gridded data include climate model output (Eyring et al., 2016), reanalysis data (Hersbach et al., 2020) and gridded station data (Cornes et al., 2018;Dunn et al., 2020).
Gridding of station data is performed using aggregation of stations within spatial boxes, often with estimates of uncertainty. This yields estimates of area-averaged data that are comparable with climate model data over a similarly sized grid box, making them widely used for climate model evaluation (Kim et al., 2020), although they have also been used to detect past changes in mean climate and climate extremes (Haug et al., 2020;Zwiers et al., 2011). Alternatives include kriging of station data (Rohde and Hausfather, 2020), which tend to yield similar results to grid-box-averaged data over densely covered regions. Reanalysis data are data from weather analysis models that operate on grids by construction and are also often used to detect and correct biases that exist in climate models (Thorarinsdottir et al., 2020). The methodology that we develop in this paper for gridded data may be useful in other contexts and help improve the robustness of findings compared to the commonly applied independent grid-box analyses.
The structure of the paper is as follows. Section 2 describes the data that are used for fitting statistical models. Section 3 describes the models that are considered and Sect. 4 presents the results which are summarized in Sect. 5.

Data
We use the daily E-OBS data, publicly available through the European Climate Assessment and Dataset (ECA & D) project. E-OBS is based on observational data from an underlying network of weather stations interpolated onto a regular 0.25 • × 0.25 • grid. Although the data set covers all of Europe as well as northern Africa and the Middle East, the spatial density of the underlying weather station network that is used to estimate the gridded areal averages is highly variable over the domain.
E-OBS is frequently used as a benchmark at the European scale (Kotlarski et al., 2017) and was also used in the most recent IPPC report, e.g. for the atlas (IPCC, 2021a) of observed trends in temperature and precipitation. It is used in Haug et al. (2020), who also seek to provide a more rigorous methodology beyond the independent grid-box-level analyses by fitting a spatial model for trends in mean temperatures in Europe. However, they use an earlier version of E-OBS that is less suitable for detection of trends than the version we use here.
Both Hofstra et al. (2009) and Hofstra et al. (2012) express reservations about using E-OBS for the detection of trends, mainly due to inhomogeneities that may be present in the underlying station data, i.e. non-climatic factors such as changes in instruments or observing practices, as well as the fact that the network density is not homogenous in time. The documentation accompanying the release of E-OBS v18.0 also comes with a similar warning: "it remains the case that many of the input station series have not been homogenized and at present we caution against the use of E-OBS for evaluating trends" (Cornes et al., 2018). For this reason we use the E-OBS v19.0eHOM data, which are a version of E-OBS that has been homogenized by the ECA & D in collaboration with the Horizon 2020 EUSTACE project. The method by which the data were homogenized is described in Squintu et al. (2019).
In addition to inhomogeneities, a further issue with observation-based gridded data is that in regions with very low station density, grid-box areal averages may be poorly estimated and have large interpolation uncertainties. However, these problems are less severe for a spatially smooth variable such as temperature in comparison to precipitation (Doblas-Reyes et al., 2021, Sect. 10.2.2.4). As discussed in Hofstra et al. (2009), the spatial smoothness of temperature means that, although extreme-value methods have not been employed in the gridding of the daily-level E-OBS data, overall extreme temperature events will be quite well represented. Working on a much coarser grid than we do here, Zwiers et al. (2011) argue that the spatial correlation of temperature at very large distances means that even a single weather station should represent the grid-box mean extremes well. As we are interested in detection of large-scale spatially averaged changes in the distribution of TXx, issues such as poor sampling of topography in low-density regions should be less of an issue in comparison if we were seeking to quantify changes at a local level.
A plot of the station network density used in E-OBS can be found in Schrier et al. (2013), which shows that the highest density is in central Europe and that there is particularly low density in northern Africa, the Middle East and eastern Europe. The data set covers the years 1950 to 2018, with some missing data mainly in the early years, although there are also few data for Russia in the last 10 years. We consider a large subset of the full domain covered by E-OBS, shown in Fig. 2, that has reasonable station density. The values displayed in Fig. 2 are the maximum value of TXx recorded during our study period, 1950-2018. With the exception of the United Kingdom and the Republic of Ireland, islands off the mainland are excluded. We set the value of TXx at a given location in a given year as missing if there are more than 10 missing daily values in that year. This is a slightly stricter criterion than is typically applied in other studies; e.g. Zwiers et al. (2011) allow for 15 missing observations. The value of TXx in a given grid box in a given year corresponds to an extreme of a regional average which is, arguably, more useful for measuring heat risk than a local, point-wise, extreme.
For atmospheric CO 2 concentration, we use data from the shared socio-economic pathway (SSP), compiled in Meinshausen et al. (2020). The historical, observation-based SSP data are only available until the year 2015, after which projections are provided until the year 2500 under different socio-economic scenarios. For the years 2016 to 2018, we took values from a mid-range scenario, namely SSP2-4.5, that are similar to the values recorded at the Mauna Loa Observatory (Keeling et al., 1976). Although the Mauna Loa data have observations available for the years 2016-2018, they have no observations during 1950-1958, so that the first 9 years of our study period are missing.

Generalized extreme-value distribution
Our approach is based on fitting generalized extreme-value (GEV) distributions to the TXx values at each grid box. Another possible and theoretically well-founded approach to modelling extremes is the peaks-over-threshold method (Davison and Smith, 1990), which models the distribution of exceedances above some large threshold rather than the maximum over large blocks of observations. We prefer the block maxima approach in our setting due to the difficulty in making a principled choice of appropriate thresholds at such a large number of spatial locations as well as the sensitivity of inference to the choice of thresholds, which is exacerbated by the presence of trends (Northrop and Jonathan, 2011).
Just as variations in the mean of a large number of independent and identically distributed random variables are naturally modelled by a normal (Gaussian) random variable, variations in the sample maximum are most naturally modelled by a GEV random variable with distribution function where = (µ, σ, ξ ), σ > 0, is a vector of parameters that relate to the location, scale and shape of the distribution respectively and x + = max(x, 0). The formal justification for using the GEV distribution to model TXx comes from the extremal types theorem (Coles, 2001, Theorem 3.1.1). The case where ξ = 0 in Eq. (1) should be interpreted as the limit as ξ → 0, which gives rise to the Gumbel distribution function G(y) = exp[−exp{−(y − µ)/σ }], y ∈ R. The case ξ > 0 is known as the Fréchet class of distributions and ξ < 0 as the Weibull class. The three classes (ξ = 0, > 0, < 0) differ from each other in the behaviour in their upper (right) tail. For the Fréchet class, the right tail decays according to a power law, and for larger values of ξ , extremes take on an increasingly volatile nature, such as might be expected in financial (Resnick, 2007) or hydrological (Katz et al., 2002) applications. The Weibull class has an upper bounded right tail, with µ − σ/ξ the theoretical maximum possible value, whereas the Gumbel class is an intermediate case with a light upper tail that decays exponentially. Typically, when modelling annual maximum temperatures, we expect them to be in either the Weibull or Gumbel class, i.e. ξ ≤ 0 (Andrade et al., 2012).
Suppose that, in grid box i of the E-OBS data, we observe the annual maximum temperature in a total of n i years, say t i1 , t i2 , . . ., t in i , which for most grid boxes is each year from 1950 to 2018 inclusive, so that n i = 69. Let y it j denote the annual maximum temperature in grid box i in year t ij , 1 ≤ j ≤ n i . If we assume that y it 1 , y it 2 , . . ., y it n i are independent realizations of a GEV random variable with a distribution function as in Eq. (1) and parameters i = (µ i , σ i , ξ i ), then one way to estimate i is to find the parameter configurationˆ i = (μ i ,σ i ,ξ i ) that maximizes the log-likelihood function l( i ), i.e.ˆ i = arg max l( i ), where l( i ) = n i j =1 log g(y it j ; i ) and g(y) = dG dy , with G as in Eq. (1), is the GEV density function. The explicit expression for the log-likelihood function is with the case ξ i = 0 being defined by continuity. Although the annual maximum temperatures may not be independent, it is assumed that the dependence between maxima from different years is sufficiently weak that the log-likelihood in Eq.
(2) may be used as a reasonable approximation of the true likelihood. The resulting maximum-likelihood estimator i is a consistent and asymptotically normal estimator of the true parameter vector provided that ξ > −1/2 (Smith, 1985;Bücher and Segers, 2017). Hosking (1985) gives details for implementing the Newton-Raphson method to find the parameters that maximize Eq. (2), and several R (R Core Team, 2021) packages, e.g. ismev (Heffernan and Stephenson, 2018) or extRemes (Gilleland and Katz, 2016), provide routines for estimating the GEV parameters using maximum likelihood. The maximum-likelihood estimates of the three GEV parameters at each grid box of the E-OBS data are shown in Fig. 3, which were calculated using ismev.
Having estimated i , we may estimate the temperature y p that is exceeded in grid box i in a given year with probability p by solving the equation G(y p ) = 1 − p for y p , with G as in Eq. (1). This yields the estimateŷ p of y p : The quantity y p is known as the return level with an associated return period 1/p. From Eq.
(3) we see that errors in the estimated value of ξ i may be magnified in the estimate of y p . When the sample size is small, the maximum-likelihood estimator of ξ i can have a high bias, leading to absurd estimated return levels that would be deemed physically impossible, and several authors (Coles and Dixon, 1999;Martins and Steidinger, 2000) have proposed adjustments to the log-likelihood function, Eq. (2), to overcome this difficulty.
An alternative to maximum-likelihood estimation that is more robust to small sample sizes is the method of Lmoments or, equivalently, probability-weighted moments Hosking, 1990). The method of Lmoments is often used in a spatial setting (Kharin and Zwiers, 2005) as part of a regional frequency analysis (RFA), as set out in Hosking and Wallis (2005). In a RFA, data from different regions that are deemed to be sufficiently homogenous are pulled together to increase the sample size and hence reduce the uncertainty in parameter estimates. One difficulty with RFA is the sensitivity of the results to the method used to identify homogenous regions. Moreover, L-moment estimation is not suited to statistical modelling as it does not allow the GEV parameters to depend on the values of covariates such as the atmospheric level of CO 2 . Our approach, described in Sect. 3.2, has a similar motivation to RFA, but as it is likelihood-based, it allows for the inclusion of covariates.

Statistical models
In this section we describe the statistical models that we fit to the E-OBS data. For computational convenience, and also to allow for the possibility that different models may be better suited to different regions, we partition our spatial domain into eight sub-regions, which are defined in Table 1. The abbreviations used for the regions are meant to be informative and correspond, roughly, to south-western Europe and France (SWFR), central and southern-central Europe (CESC), central Europe 2 (CE2), south-eastern Europe (SE), eastern Europe (EAST), Norway and Sweden (NRSW), Finland (FIN) and the United Kingdom and Republic of Ireland (UKRI).
For statistical modelling of TXx, the log-likelihood function in Eq. (2) may be considered too simple in at least two respects. Firstly, it assumes that the GEV parameters at a given grid box remain fixed from year to year, whereas a potentially more realistic model would allow them to change over time. Secondly, we expect that the parameters of neighbouring grid boxes are more likely to be similar than those of grid boxes that are far apart, and Eq. (2) does not allow us to incorporate this belief. Moreover, maximizing Eq. (2) separately for each grid box i may lead to highly uncertain or unrealistic parameter estimates due to the small sample available at each grid box, and for the purposes of statistical inference, we also run into problems with multiple comparisons (Farcomeni, 2008;Chen et al., 2017).
The dependency of the GEV parameters on time can be linked to that of a climatological covariate, and for this purpose we will use the atmospheric concentration of CO 2 , which is the dominant greenhouse gas that affects temperature (Stips et al., 2016). More specifically, we will use the derived covariate x t = log (CO 2,t /280), where CO 2,t is the atmospheric concentration, in parts per million (ppm), of CO 2 in year t of our study period, 1 ≤ t ≤ 69, with t = 1 corresponding to the year 1950, and 280 ppm is, approximately, the pre-industrial atmospheric concentration of CO 2 . The reason for using the log-transformed covariate x t rather than the raw CO 2,t values is the approximate logarithmic effect of CO 2 on temperature (Jones and Hegerl, 1998).
We assume that the annual maximum temperature in grid box i in year t, 1 ≤ t ≤ 69, follows a GEV distribution with the time-varying parameter vector it = (µ it , σ it , ξ it ). A simple model we may consider, to which we will add further structure and covariates later, is For this model, only the GEV location parameter is timevarying. The intercept parameter µ (0) i can be interpreted as the value of the GEV location parameter in grid box i if atmospheric CO 2 were at its pre-industrial level, whereas the slope parameter µ (1) i is the change in the location parameter that would occur if x t increased by 1 unit, i.e. if atmospheric CO 2 increased by a factor of e ≈ 2.718. Over the course of our study period, atmospheric CO 2 has increased by a factor of 1.31, i.e. CO 2,69 = 1.31CO 2,1 . As in Sect. 3.1, if in grid box i we have observations in years t i1 , t i2 , . . ., t in i , then, writing If there are n grid boxes in total, then maximizing Eq. (4) separately for each i, 1 ≤ i ≤ n is equivalent to jointly maximizing the function l(θ) = n i=1 l i (θ i ), where θ = (θ 1 , θ 2 , . . ., θ n ) is the vector containing the parameters for all the grid boxes. This is because the j th term in the summation defining l(θ ) contains only the parameters of grid box j which occur at no other terms in the summation, and so the maximization problem is separable. Rather than fitting a model where every grid box is forced to "learn for itself", one way to obtain parameter estimates for all grid boxes that are spatially coherent and reduce uncertainty in the estimates is to add a term to the objective function l(θ ) that will penalize model fits where there is too much local variation in the parameters. As we are working on a discrete gridded domain, it is natural to use a penalty that is based on Gaussian Markov random fields (Rue and Held, 2005).  The Gaussian Markov random field (GMRF) penalty allows us to formalize the belief that grid boxes that are near to each other are more likely to have parameter values that are similar than those that are far apart. In order to define the GMRF penalty, we are required to specify a neighbourhood structure for our domain. Specifically, for each grid box i we are required to specify the set of neighbours of i, which we denote by N(i) and define as those grid boxes that share a common (grid-box) edge with i. Thus, for most grid boxes, N(i) will consist of four neighbours, but in some cases there may be less than four, e.g. if i lies on the boundary of the domain. Now, if we define N(i) as the set of neighbours j of i with j > i, then placing a GMRF penalty on the GEV shape parameters ξ = (ξ 1 , ξ 2 , . . ., ξ n ) T , for example, amounts to the penalty term where the penalty matrix S satisfies S ij = −1 if i ∈ N(j ) and S ii = n i , where n i is the number of neighbouring regions of i, not including i. P GMRF (ξ ) takes larger values when there is more local variability in ξ i , 1 ≤ i ≤ n. If we also impose a GMRF penalty on each of the location intercept, slope, scale and shape parameters, then, writing n ) T , σ = (σ 1 , σ 2 , . . ., σ n ) T and ξ = (ξ 1 , ξ 2 , . . ., ξ n ) T , the objective function that we seek to maximize is the penalized log-likelihood with l i (θ i ) as in Eq. (4) and λ i > 0, 1 ≤ i ≤ 4 constants. The constants λ i , 1 ≤ i ≤ 4, are smoothing, or regularization, parameters that specify the relative priorities given to the competing goals of smoothness and fitting a model that closely matches the observed data. If, for example, λ 4 is extremely large, then we would obtain a fit with a low amount of variability in ξ . Rather than subjectively choosing a value for the smoothing parameters, they may be selected in a more objective manner, by marginal-likelihood maximization as in Wood et al. (2016), and this is the approach taken in R (R Core Team, 2021) package evgam (Youngman, 2022) that we use for model fitting.
The objective function in Eq. (6) is the same as the log posterior obtained from a Bayesian model specification where the observations from grid boxes i and j , i = j , are conditionally independent given (θ i , θ j ) and independent intrinsic Gaussian Markov random field priors (Rue and Held, 2005, chap. 3) are placed on each of the parameter vectors µ 0 , µ 1 , σ and ξ . The conditional independence assumption is standard in Bayesian spatial (Banerjee et al., 2004) and latent Gaussian (Rue et al., 2009) modelling, but note that this is not the same as assuming that the observations from grid boxes i and j , i = j , are independent. The fact that the smoothing parameters, which correspond to hyperparameters in a Bayesian analysis, are found by maximizing a marginal likelihood means that the fitting approach may be regarded as empirical Bayes. However, this approach cannot be considered fully Bayesian, as this would require the smoothing parameters to be given a prior distribution and inference on all parameters to be performed using their posterior distributions. The parameter vector θ is estimated byθ , which maximizes the penalized likelihood (Eq. 6). Confidence intervals for any component of θ , or linear combination of components, can be computed based on the asymptotic normality ofθ (Wood et al., 2016, Sect. 2).
Although commonplace, the conditional independence assumption implied by Eq. (6) is still something of an idealization that should not be expected to hold exactly. For example, adjacent grid boxes may both be affected by the same heatwave event leading to the annual maxima of these grid boxes occurring on the same day. The consequence of basing inference on Eq. (6) when the conditional independence assumption is not satisfied is that confidence intervals of model parameters will be narrower than those obtained from the true model, as we essentially exaggerate the amount of information contained in the data about the model parameters. In order to counteract the likely misspecification of conditional independence, in addition to basing inference on Eq. (6), we also consider applying the magnitude adjustment of Ribatet et al. (2012) (see also Chandler and Bate, 2007) to the likelihood, so that uncertainty in parameters is quantified in a more realistic manner. The correction amounts to replacing the term n i=1 l i (θ i ) in Eq. (6) with k n i=1 l i (θ i ) for some k ∈ (0, 1]. The constant k may be interpreted as the effective proportion of locations with independent data and may be estimated from the data as the reciprocal of the mean of the eigenvalues of the Godambe information matrix. In the evgam (Youngman, 2022) package, the magnitude adjustment is implemented via the sandwitch.args argument. It is still assumed that observations from different years are independent. Estimating k from the data requires inverting and calculating the eigenvalues of very large matrices, and for the largest regions, CESC, EAST and NORD defined in Table 1, evgam was unable to perform these computations in a numerically stable manner. In these cases we simply set the value of k to 0.34, which is approximately the mean of the estimated values in the other regions. In the other five regions there is little variability in the estimates of k, with all values being between 0.31 and 0.39.
The model that has been described so far in this section contains only a single covariate in the GEV location parameter, and we have seen how the effect of this covariate on the annual maximum temperature can be modelled as smoothly varying over space by using the GMRF penalty. The value, x t , of the covariate in year t is taken to be the same at each grid box in year t, so that the covariate is spatially homogenous. We may also include spatially varying covariates in our model, and for this purpose we will include elevation (km) as a covariate in the GEV location parameter. From  Fig. 3, it is clear that larger values of elevation tend to be associated with smaller values of the GEV location parameter. The framework of Wood et al. (2016) allows us to model covariates as having a generally smooth, rather than simply linear, effect on the location parameter. However, based on exploratory model fits, we find it adequate to specify elevation as having a linear effect on the GEV location term. We also consider models that have trends in the GEV scale parameter. In total, we consider five different models that differ from each other with regards to the inclusion of the covariate x t = log(CO 2,t /280). For each model, it is assumed that Y it ∼ GEV(µ it , σ it , ξ it ), where, as before, Y it corresponds to the value of TXx in grid box i in year t. The differences between the models with regards to the inclusion of trends are summarized in Table 2. Consistent with most of the literature, we assume that the shape parameters vary only in space but not in time.
The most complex model is Mod4, which corresponds to the following formulas for the GEV parameters: where x t = log(CO2 t /280), elevation i corresponds to the elevation (km) of grid box i minus the mean elevation across all grid boxes, and independent GMRF penalties are placed on µ 0 , µ 1 , n ) T and ξ . The fixed-effect β gives the change in the GEV location parameter for a 1 km increase in elevation. The trend in the scale parameter is modelled using the log link to ensure that the scale remains positive. Mod1, Mod2 and Mod3 are each special, simpler, cases of Mod4. In particular, Mod1 has µ 1 = 0 and σ 1 = 0, corresponding to the situation where there is no climate change signal detectable in TXx, whereas Mod2 and Mod3 correspond to the cases σ 1 = 0 and µ 1 = 0 respectively. Mod5 has only a trend in the GEV location parameter, but this is modelled as a fixed effect; i.e. the same trend is assumed at each geographic location, corresponding to µ it = µ (0) i + µ 1 x t + β elevation i , and note that the trend µ 1 does not depend on i. Table 2. Comparison of Mod1-Mod5 according to the inclusion of a trend in x t = log(CO2 t /280) in the GEV location (µ) and logscale (log σ ) parameters and whether these trends are assumed to vary over space, i.e. are spatially varying (SV), or the same trend is assumed at each grid box, i.e. are spatially homogenous (SH). All the models include elevation (altitude) as a covariate.
Model Trend in µ Trend in log σ

Mod1
No No Mod2 Yes (SV) No Mod3 No Yes (SV) Mod4 Yes (SV) Yes (SV) Mod5 Yes (SH) No To illustrate the effect and benefit of using the GMRF smoothing penalty, we compare, for region UKRI, the independent grid-box fits based on maximizing Eq. (4) separately for each i, using R package ismev, with joint maximization of the penalized log-likelihood (Eq. 6), with magnitudeadjusted likelihood, for the smooth model using evgam. The value of the constant k for the magnitude adjustment of the conditional independence likelihood was estimated to be 0.35. Figure 4a and b show the fitted values of the GEV shape parameters, ξ , for the independent grid box and smooth model fits respectively. Although the broad spatial pattern of fitted shapes is similar in both cases, the fitted shapes for the smooth model encompass the more plausible range of −0.44 to −0.07 compared to the independent grid-box fits, which range from the extremely short tail of −0.71 to the heavytailed case of 0.16. The reduction in uncertainty that occurs by including neighbouring information in the model-fitting procedure is also illustrated in Fig. 4c. This shows the ratio in parameter uncertainty, as measured by the standard error, for the independent grid-box model fits relative to the smooth model. For the independent grid-box model fits, the standard errors were computed based on asymptotic normality of the maximum-likelihood estimators, and for the smooth model, standard errors were also computed based on asymptotic normality using the results of Wood et al. (2016, Sect. 2). The mean ratio is equal to approximately 3.7, which represents the average reduction in uncertainty achieved by including neighbouring information in the model fitting.

Changes in return levels and risk ratios
In Sect. 3.1 we defined the return level with a return period 1/p to be the temperature that, in a stationary climate, is exceeded in a given year with probability p. In a non-stationary climate, this quantity will typically vary from year to year. For grid box i in year t, 1 ≤ t ≤ 69, we modify Eq. (3) and define y it (p) by show the fitted GEV shape parameters, ξ , for independent, i.e. separate, grid-box fits compared to a smooth model fit using the GMRF penalty. Plot (c) shows the ratio of the parameter uncertainty, as measured by the standard error, for the independent grid-box fits relative to the smooth model fit which uses information from neighbouring grid boxes.
The quantity y it (p) may be interpreted as the return level with return period 1/p if the climate were stationary in the same state as in year t. We consider for each grid box i the difference y i69 (0.01)−y i1 (0.01), which tells us the difference in the 100-year return levels in grid box i based on the 2018 and 1950 climates. We quantify the uncertainty in the returnlevel differences via Monte Carlo simulation. In particular, at each grid box, we sample 2000 values of µ it , σ it and ξ it for t = 1 and t = 69 from their sampling distributions and calculate the corresponding differences y i69 (0.01) − y i1 (0.01). We then calculate the 0.025 and 0.975 quantiles of these 2000 values to give us an approximate 95 % confidence interval for the return-level differences. An example of how to perform these steps using the evgam package can be found in Sect. 4.3 of Youngman (2022).
Another way that we quantify changes in the distribution of the annual maximum temperatures is via risk ratios (Titley et al., 2016). If Y it denotes a random variable with the same distribution as the annual maximum temperature in grid box i in year t, i.e. GEV with parameter vector it = (µ it , σ it , ξ it ), The value of the ratio (Eq. 8) then tells us, in grid box i, how many times more likely the 100-year return level based on the 1950 climate is to be exceeded in the 2018 climate.
We quantify the uncertainty in the estimated risk ratio in the same way as the return-level difference, via simulation.

Results
All the models were fitted using the R (R Core Team, 2021) package evgam (Youngman, 2022) on a Dell PowerEdge Table 4. Approximate 95 % confidence intervals for the spatially averaged 100-year return-level differences ( • C), based on 2018 and 1950 climates (2018 return level subtract 1950 return level) and risk ratios for each region defined in Table 1. The results are based on Mod4, which includes a trend in the GEV location and log-scale parameters using the covariate log(CO 2,t /280). The endpoints of the intervals were calculated using Monte Carlo simulation. We simulated values of the GEV parameters from their sampling distributions at each grid box based on the 2018 and 1950 climates. We then calculated the return-level differences and risk ratios at each grid box and calculated the mean across the region. This procedure was repeated 2000 times. The α/(2 × 8) and 1 − α/(2 × 8), with α = 0.05, empirical quantiles of the 2000 estimated means give the left and right endpoints respectively of the intervals shown, where we have corrected for multiple comparisons using the Bonferroni correction. The more conservative (adjusted) intervals are based on the magnitude correction to the likelihood as in Ribatet et al. (2012).  R430 computer running Scientific Linux 7 with four Intel Xeon E5-2680 v3 processors. As this is a shared departmental cluster, our access was restricted to 10 cores. For a given fixed sub-region in Table 1, we assess the performance of each of the five models in Table 2 using several scoring rules (Gneiting and Raftery, 2007) in a 5-fold cross-validation (Stone, 1974;Hastie et al., 2009). The scoring rules considered are the squared error (SE), Dawid-Sebastiani (DS), continuous ranked probability (CRP) and the weighted continuous ranked probability (WCRP) scores. All the scoring    Table 1 and all regions together (bottom right plot). The GEV parameters in 1950 and 2018 are calculated using Mod4, which has trends in GEV location and log-scale parameters using the covariate log(CO 2,t /280). Table 5. P values from a test of pair-wise exchangeability of CR-P/WCRP scores for the two best models, Mod4 and Mod2, by region.
Region CRP p value WCRP p value SWFR < 10 −6 < 10 −6 CESC < 10 −6 < 10 −6 CE2 < 10 −6 < 10 −6 SE 3 × 10 −6 8.4 × 10 −5 EAST < 10 −6 < 10 −6 NRSW < 10 −6 4.3 × 10 −5 FIN < 10 −6 < 10 −6 UKRI < 10 −6 < 10 −6 rules we consider are negatively oriented, so that a smaller score indicates better performance. Further information on the scoring rules and the cross-validation procedure can be found in Appendices A and B respectively. For a given subregion in Table 1 the cross-validation was performed in parallel using the R package parallel. The total compute time for the full cross-validation in all sub-regions was between 2 and 3 weeks. A version of the Akaike information criterion (AIC) that is appropriate for the class of general smooth mod-els we fit was developed in Wood et al. (2016), and we also report this value for each model. As with the scoring rules we consider, models with smaller AIC values are preferred. Within the main text and figures, all reported confidence intervals are based on the magnitude-adjusted likelihood of Ribatet et al. (2012) as described in Sect. 3.2. Table 3 shows the mean scores of each model by region from 5-fold cross-validation. These scores are based on the conditional independence likelihood model fits using Eq. (6). The corresponding model scores (not shown) when performing the magnitude adjustment to the likelihood are very similar, leading to the same conclusions. For all regions, Mod4, which includes a spatially varying trend in both the GEV location and log-scale parameters, is the bestperforming model, closely followed by Mod2, which only contains the spatially varying trend in the GEV location parameter. Mod5, which has a fixed effect of the covariate x t = log (CO 2,t /280) in the location parameter, i.e. a constant trend at all grid boxes in a given sub-region, is generally the next best-performing model. Mod1, which has all of the GEV parameters fixed in time, is the worst-performing model in all regions according to all the scores, followed by Mod3, which has only a spatially varying trend in the GEV log-scale parameter. Table 6. Approximate 95 % confidence intervals for the spatially averaged changes in the GEV location and scale parameters (2018 parameter value subtract 1950 parameter value) for each region defined in Table 1. Calculations are based on Mod4, which has trends in GEV location and log-scale parameters using the covariate log(CO 2,t /280). The endpoints of the intervals are calculated by Monte Carlo simulation as described in the caption to Table 4. The more conservative (adjusted) intervals are based on the magnitude correction to the likelihood as in Ribatet et al. (2012).
Spatially averaged Spatially averaged Spatially averaged Spatially averaged Region change in change in change in change in location location scale scale (adjusted) (adjusted) Table 5 shows the p values, by region, of the hypothesis test that the CRP scores shown in Table 3 for Mod4 and Mod2 (the two best-performing models) are pair-wise exchangeable. The testing procedure is described in Appendix B. Failure to reject the null hypothesis would, informally, mean that the performances of Mod4 and Mod2 are statistically indistinguishable. In all regions, the null hypothesis of pair-wise exchangeability is rejected at the 0.05 and 0.01 levels of significance after adjusting the raw p values in Table 5 using the Bonferroni correction for multiple comparisons, giving evidence in favour of the better-performing Mod4. We also perform the same test for the WCRP scores of Mod4 and Mod2. For all regions the null hypothesis of pair-wise exchangeability of WCRP scores is rejected at the 0.05 and 0.01 level of significance, giving evidence in favour of Mod4. Thus, we find strong evidence in favour of both changes in location and scale of the distributions of TXx since 1950. Figure 5 shows the difference in 100-year return levels based on the 2018 and 1950 climates, for both Mod4 and Mod2, along with approximate 95 % confidence intervals, calculated using Monte Carlo simulation, as described in Sect. 3.3. The corresponding risk ratio plots are shown in Fig. 6. Mod2, which only has a trend in the GEV location parameter, tends to find slightly larger increases in the 100year return levels based on the 2018 climate compared to 1950, but the spatial pattern broadly agrees with that produced by Mod4, and similar comments apply for the risk ratios. All the regions show, on average, significant increases in return levels and risk ratios greater than one, as shown in Table 4. Region NRSW, comprising Norway and Sweden, stands out as having relatively moderate changes compared to the other regions, with the mean increase over the region in the 100-year return level being around 0.6 • C. There are several locations within this region where negative changes are detected, although upon inspecting the approximate 95 % confidence intervals in Figs. 5 and 6, most of these changes do not appear to be significant. Even in this region of relatively modest changes, the 100-year return level based on the 1950 climate is estimated to be, on average, approximately 4 times more likely to be exceeded in the 2018 climate. The region EAST comprising eastern European countries shows the most dramatic increases, with a mean 100-year return-level difference of around 2.8 • C and a mean risk ratio in excess of 26. Averaging over the entire spatial domain, we find that, under Mod4 (change in location and scale), the mean difference between the 100-year return levels based on the 2018 and 1950 climates is 2.08 • C, with a 95 % confidence interval of [2.03, 2.12]. For Mod2 (change in location only), the mean difference is 2.29 • C, with a 95 % confidence interval of [2.26, 2.33]. Similarly, under Mod4, the mean risk ratio over the entire spatial domain is 16.12, with a 95 % confidence interval of [15.74, 16.53], so that on average a 100year return level in the 1950 climate corresponds approximately to a 6-year return level in the 2018 climate. For Mod2, the mean risk ratio is increased to 18.00, with a 95 % confidence interval of [17.65, 18.34].
Changes in the GEV location and scale parameters over the period 1950 to 2018 calculated using Mod4 are shown in Fig. 7. Approximately 96 % of the grid boxes show an increase in the location parameter over the study period, and after calculating approximate 95 % confidence intervals for each change in location, 88 % of these have a lower limit greater than zero. We cannot however conclude that warming is detected in TXx in 88 % of grid boxes as we make no attempt to correct for multiple comparisons. The changes in the GEV scale show rather more variability, with 37 % positive and 63 % negative fitted values. After calculating approximate 95 % confidence intervals for the scale changes, approximately 50 % of these slopes contain 0. The spatially averaged changes in the GEV location and scale parameters for each sub-region in Table 1 are shown in Table 6, and the spatially averaged distributions of TXx in 1950 and 2018 are shown for each sub-region in Fig. 8. All the regions show, on average, significant increases in the GEV location parameter, indicating a tendency for TXx to shift towards hotter temperatures. There is a mixture of both increases and decreases in the spatially averaged GEV scale parameter differences corresponding to increasing and decreasing variabilities of TXx respectively. The regions with an average increase in the GEV scale parameter occur in the east and in Ireland. We find that, under Mod4, a 95 % confidence interval for the spatially averaged change in the GEV location parameter over all regions is [2.28, 2.32], whereas the corresponding interval for the change in scale is [−0.08, −0.05]. Thus, although there is a strong signal for shifts of the TXx distributions in all regions towards hotter temperatures, as indicated by the positive changes in the GEV location, the mixtures of increases and decreases in the GEV scale parameter approximately cancel each other out when spatially averaged across our study region. The changes in the GEV location parameter for Mod2 are the same as the return-level differences shown in the bottom row of Fig. 5.
In the same notation as Sect. 3.2, for Mod4, a test of the null hypothesis µ 1 = 0, i.e. all location slopes are zero, using the asymptotic distributional results of Wood et al. (2016), has for each region a p value of less than 2 × 10 −16 , giving strong evidence against the null hypothesis. The same result is found for the location slopes from Mod2 and log-scale slopes for Mod4. Together with the results in Table 3, the hypothesis of no temporal variation in the GEV parameters is strongly rejected in all regions, supporting robust detection of changing TXx distributions over time.

Conclusions
We have considered the problem of detecting and quantifying large-scale changes in the distributions of the annual maximum daily maximum temperature (TXx) in a large subset of Europe during the years 1950-2018. Our approach was to divide the full domain into eight sub-regions over which several statistical models were fitted. In each of the models considered, TXx at each grid box was modelled using a generalized extreme-value (GEV) distribution with the GEV location and scale parameters allowed to vary in time using atmospheric CO 2 as a covariate. We modelled the GEV parameters as varying smoothly over space, where the appropriate degree of smoothness was determined objectively using the methods of Wood et al. (2016). Changes were detected most strongly in the GEV location parameter, with the distributions of TXx shifting towards hotter temperatures at most grid boxes. Although the best-performing model in all regions has both the GEV location and scale parameters changing in time, the signal for changes in the scale param-eters is noisier than that for the location parameters, with some regions showing a tendency to increases in scale and others a decrease. The regions that show a tendency to an increase in scale, corresponding to an increase in the variability in TXx, are in eastern Europe and Ireland. The secondbest-performing model in all regions has only the GEV location parameter changing in time. Regardless of whether our best or second-best models were used, our main findings regarding changes in return levels based on the 2018 and 1950 climates and risk ratios broadly agree. Using our best-performing model and averaging across our entire spatial domain, the 100-year return level of TXx based on the 2018 climate is approximately 2 • C hotter than that based on the 1950 climate. Also averaging across our spatial domain, the 100-year return level of TXx based on the 1950 climate corresponds approximately to a 6-year return level in the 2018 climate. Our findings are most robust in central Europe, where the underlying network of weather stations used to construct the gridded data set has the highest density. Finally, although we made an effort to mitigate against well-known deficiencies of gridded station data, namely inhomogeneities and low station density regions, the data we used will nonetheless be imperfect. It would therefore be of great interest to see how well our findings may be replicated using different data sources in future studies.

Appendix A: Model scoring
We use several scoring rules that evaluate the performance of models based on their ability to predict unseen data, i.e. data that were held out from the model-fitting procedure. A scoring rule is a function, S, that assigns a real number value S(F, y) to the pair (F, y), where F is a distribution function and y is an observed value. In our case F will be the cumulative distribution function of a fitted GEV distribution and y will be some observation, held out from the model-fitting procedure, that under our model is assumed to be drawn from F . Intuitively, the value of S(F, y) can be thought of as measuring the extent to which the distribution F and observation y are compatible. All of the scoring rules that we consider are negatively oriented, so that smaller scores correspond to better predictions.
A negatively oriented scoring rule S is called proper (Gneiting and Raftery, 2007) if where E denotes expectation; i.e. on average, the true distribution G will not give a worse score than any other distribution F , so that forecasters are incentivized to report the truth. All the scores that we consider are proper. If equality in Eq. (A1) holds only if F = G, then S is called strictly proper. One of the simplest and most common scoring rules is the squared error score S SE defined by where µ F is the expected value of a random variable with distribution F . The squared error score gives higher penalties the further an observation is from the mean of the distribution but may be regarded as rather simplistic in that the distribution F is represented in the score only through its mean µ F . When the variance of F is large, we may wish to give a smaller penalty to observations y that are far from the mean µ F . One score that does this is the Dawid-Sebastiani score, S DS , defined by where σ 2 F is the variance of F . Another commonly used scoring rule is the continuous ranked probability score (CRPS) defined by Unlike the squared error and Dawid-Sebastiani scores, the CRPS takes into account the full distribution F and compares it with the empirical distribution based on the single observation y. A closed-form expression for the CRPS when F is the distribution function of a GEV random variable is given in Friederichs and Thorarinsdottir (2012) and implemented in the R package scoringRules (Jordan et al., 2019).
For an extreme-value analysis, it may be desirable to consider a scoring rule that gives a higher penalty for poor prediction in the tails of the distribution. One way to do this is to use a weighted version of the CRPS (Gneiting and Ranjan, 2011). First, we note that an alternative representation of the CRPS in terms of quantiles is The equality of Eqs. (A4) and (A5) is shown in Laio and Tamea (2006). The weighted continuous ranked probability score (WCRPS) is obtained from Eq. (A5) by adding an extra factor, w(p), to the integrand, which determines the weight given to the pth quantile giving In our application we use the weighting function w(p) = p 2 . As the integral Eq. (A6) does not have a closed-form solution when F is the distribution function of a GEV random variable, we approximate it with the summation for large N and 0 ≤ p 1 < p 2 < . . . < p n ≤ 1 a partition of the interval [0, 1]. In our case we take the evenly spaced partition p i = i/N with N = 1000 and i = 1, 2, 3, . . ., 999. We use each of the scoring rules described above as part of a cross-validation scheme to evaluate a model's performance. This is described in Appendix B.

Appendix B: Cross-validation and score comparisons
For each of the regions in Table 1, we evaluate the performance of each of the models in Table 2 using 5-fold crossvalidation. Specifically, for a fixed region, we randomly assign each observation of the region to one of five subsets, or splits, of approximately equal size. The same splits are used in each model evaluation. Suppose there are N j observations in split j, 1 ≤ j ≤ 5, which we denote by y We fix one of the splits, k say, 1 ≤ k ≤ 5, to be used as test data and fit the model of interest to the data from the remaining four splits. For each observation y is the GEV distribution function that, under the fitted model, y (k) i is assumed to be drawn from. This procedure is carried out for each k, 1 ≤ k ≤ 5, so that each split gets used as test data. The mean score, k=1 N k is the total number of observations in the region, gives an overall measure of model performance according to score S. For the scores defined in Appendix A, which are all negatively oriented, models with lower mean scores are preferred.
Suppose that in the cross-validation procedure described above, model B produces a lower mean score than another model, A. If the observed difference in the mean scores is very small, we may wish to test whether this really provides evidence that model B is better than model A. Suppose that, for each model, we have the N scores S(F A i , y i ) and S(F B i , y i ), 1 ≤ i ≤ N, where F A i and F B i are the distribution functions that, under models A and B respectively, observation y i is assumed to be drawn from. We will construct a test for the null hypothesis that the scores S(F A i , y i ) and S(F B i , y i ) are pair-wise exchangeable for 1 ≤ i ≤ N . Two random variables X 1 and X 2 are said to be exchangeable if P(X 1 ≤ x 1 , X 2 ≤ x 2 ) = P(X 1 ≤ x 2 , X 2 ≤ x 1 ). In particular, this implies that X 1 and X 2 are identically distributed. If the scores S(F A i , y i ) and S(F B i , y i ) are pair-wise exchangeable, then the score S(F A i , y i ) would be equally likely to have been produced by model B, and similarly, S(F B i , y i ) is equally likely to have been produced by model A. Thus, for the observed score difference S − i = S(F A i , y i ) − S(F B i , y i ), we would have been equally likely to have observed −S − i under the null hypothesis. This motivates the following randomized test procedure defined in Algorithm B1 below.
The value of p computed in Algorithm B1 is an unbiased estimate of the one-sided p value for the test with null hy-Algorithm B1 Hypothesis testing procedure for testing pairwise exchangeability of model scores.

Input:
positive integer J and model scores S(F A i , y i ), S(F B i , y i ), 1 ≤ i ≤ N. Output: p, an estimate of the p value of the test.
Compute the randomized score difference;  pothesis that the scores S(F A i , y i ) and S(F B i , y i ) are pairwise exchangeable, 1 ≤ i ≤ N. The value of the observed test statistic T obs is strictly positive since it is assumed that model B has the lower observed mean score. Small values of p give evidence against the null hypothesis in favour of model B. In our applications of Algorithm B1, we take J = 10 6 .

Appendix C: Diagnostic plots
In this Appendix we perform some visual checks for Mod4 to see whether this model provides a reasonable fit to the data and is not merely the best of a bad bunch of models. As it is not feasible to provide plots for every grid box, we consider the performance as a whole over the sub-regions as defined in Table 1.
One simple way to check for any systematic discrepancies between a fitted model and the observed data is to use the probability integral transform (PIT). The PIT states that, if Y is a random variable with continuous distribution function F , then the random variable F (Y ) is uniformly distributed between 0 and 1. Suppose that, given a sample of size n with observed response variables, y i , 1 ≤ i ≤ n, a statistical model fits distribution function F i to y i . Then, if the model is correct, the values F i (y i ), 1 ≤ i ≤ n are a sample of size n from a uniform distribution on [0, 1]. The plausibility of this may be checked visually, e.g. by plotting a histogram of the n PIT values F i (y i ), 1 ≤ i ≤ n. A U-shaped histogram would indicate that the fitted model is underdispersive; i.e. it does not adequately account for the variability in the data, whereas a histogram that is too peaked in the middle indicates that the model is overdispersive. Histograms of the PIT values, by region, are shown for Mod4 in Fig. A1 and do not show any serious cause for concern.
Another standard method for checking a non-stationary extreme-value model fit is via probability or quantile plots (Coles, 2001, Sect. 6.2.3). Both of these plots are based on the fact that if Y it is a GEV random variable with parameters it = (µ it , σ it , ξ it ), then the variable Z it defined by has a standard Gumbel distribution with distribution function F (z) = exp{−exp(−z)}, z ∈ R. If we fix a specific region in Table 1 and suppose that, as in Sect. 3.1, in grid box i we have observed annual maxima, y it j , 1 ≤ j ≤ n i , then, from a given fitted model, we may obtain the values z it j , 1 ≤ j ≤ n i by applying the transformation Eq. (C1). If there are N grid boxes in the region in total, then we may apply this transformation to all grid boxes and obtain m = N i=1 n i transformed z values. If the model is correct, then the ordered values z (1) , z (2) , . . .z (m) where z (k) ≤ z (l) when k ≤ l would be a sample from a standard Gumbel distribution. The probability plot tests the plausibility of this by comparing empirical and fitted-model probabilities and plots the pairs k m + 1 , exp{−exp(−z (k) )} , 1 ≤ k ≤ m.
The quantile plot compares fitted-model and empirical quantiles and plots the pairs z (k) , −log − log k m + 1 , 1 ≤ k ≤ m. The further these plots deviate from a diagonal line, the greater the discrepancy between the fitted model and the observations. Probability and quantile plots are shown by region in Figs. B1 and C1. The probability plots stay very close to the diagonal line, indicating a good quality of fit for each sub-region. The quantile plots show, to a varying extent in each region, some discrepancy in the upper tail. For the sake of reference, the values 5, 6, 7 and 8 correspond to the 0.9933, 0.9975, 0.9991 and 0.9997 theoretical quantiles of the standard Gumbel distribution respectively. Thus we can see that Mod4 gives a good fit in all regions at least up to the 0.9933 regional quantile and in several regions beyond this but fails to explain approximately the largest 0.05 % of observations in each region. Given the very large quantities of data in each region, this is of no great surprise or concern. Finally, we inspect the spatial distribution of the Pearson residuals obtained from a model fit. For grid box i we obtain the n i residuals r ij , 1 ≤ j ≤ n i defined by whereÊ(y it j ) and { Var(y it j )} 1/2 denote the model-fitted expected value and standard deviation respectively. When the fitted distribution is GEV with parameter vector it = (µ it , σ it , ξ it ), then these expressions becomê E(y it j ) =    µ it + σ it ξ it { (1 − ξ it ) − 1}, 0 < ξ it < 1, µ it + σ it γ , ξ it = 0, ∞, ξ it ≥ 1, and Var(y it j ) = where γ ≈ 0.5772 is Euler's constant and (t) = ∞ 0 x t−1 e −x dx. If the fitted model were correct, then the Pearson residuals are realizations of a random variable with mean 0 and standard deviation 1. The mean and standard deviation of the Pearson residuals for each grid box for both Mod4 and Mod2 are shown in Fig. D1. The spatial distributions look similar for both models and do not display any worryingly large deviations from the zero mean, unit standard deviation assumption. Figure D1. Mean values (a, b) and standard deviations (c, d) of Pearson residuals for Mod4 and Mod2.

Appendix D
Code and data availability. The data analysed in this paper are publicly available from https://www.ecad.eu/download/ ensembles/downloadversion19.0eHOM.php (European Climate Assessment & Dataset project with Horizon 2020 EUSTACE project, 2023). Regional R data frames for each of the regions defined in Table 1 and R scripts for fitting models and performing cross-validation and diagnostics are available from https://doi.org/10.6084/m9.figshare.21257217.v1 .