Postprocessing ensemble forecasts of vertical temperature profiles

Weather forecasts from ensemble prediction systems (EPS) are improved by statistical models trained on past EPS forecasts and their atmospheric observations. Recently these corrections have moved from being univariate to multivariate. The focus has been on (quasi-)horizontal atmospheric variables. This paper extends the correction methods to EPS forecasts of vertical profiles in two steps. First univariate distributional regression methods correct the probability distributions separately at each vertical level. In the second step copula coupling re-installs the dependence among neighboring levels by using the rank order structure of the EPS forecasts. The method is applied to EPS data from the European Centre for Medium-Range Weather Forecasts (ECMWF) at model levels interpolated to four locations in Germany, from which radiosondes are released to measure profiles of temperature and other variables four times a day. A winter case study and a summer case study, respectively, exemplify that univariate postprocessing fails to preserve stable layers, which are crucial for many atmospheric processes. Quantile resampling and a resampling that preserves the relative distance between individual EPS members improve the calibration of the raw forecasts of the temperature profiles as shown by rank histograms. They also improve the multivariate metrics of energy score and variogram score and retain the stable layers. Improvements take place over all times of the day and all seasons. They are largest within the atmospheric boundary layer and for shorter lead times.


Introduction
Ensemble prediction systems (EPS) are an important tool in modern weather forecasting for providing estimates of the range of possible forecast outcomes. The individual members of an EPS are based on numerical weather prediction (NWP) models, which simulate the fluid dynamic and thermodynamic behavior of the atmosphere and its lower boundary. NWP models are not perfect because they only approximately represent physical laws, cannot resolve processes at all temporal and spatial scales, and have to start from an inexactly known initial state. Their imperfection was actually the motivation behind EPS, which should provide a realistic and comprehensive spectrum of the possible future weather.
Statistical postprocessing techniques, which learn from past measurements and NWP EPS forecasts, can remove systematic errors of the EPS forecasting distribution (e.g., . Efforts in the development of these techniques have so far been concentrated on forecasts for individual locations or near-surface fields, where these errors might arguably be largest, due to the inability of numerical models to fully resolve boundarylayer processes and interactions between the atmosphere and the surface (e.g., Holtslag et al., 2013). Many univariate approaches for a single response variable build on the nonhomogenous Gaussian regression (NGR, , a distributional regression method. Methods have become available that not only correct the forecasts at the locations for which measurements exist, but also at any location in between. An example is the standardized anomalies model output statistics (SAMOS, Dabernig et al., 2017). Multivariate approaches attempt to preserve the correlation structure between variables or the same variable at different locations in the postprocessing step, e.g., with ensemble copula coupling (ECC, Schefzik et al., 2013;Wilks, 2015).
Postprocessing the vertical structure of the atmosphere, on the other hand, has so far been mostly neglected, with the exception of Renkl (2013), who approximated the vertical profiles by their leading normal modes and adjusted the coefficients of these modes with kriging. However, vertical profiles of air and dew-point temperatures are important tools in weather forecasting. They show the stratification and hence the static stability of the atmosphere and allow one to draw conclusions about vertical mixing. Vertical temperature profiles are used to diagnose and forecast precipitation (e.g., Reeves et al., 2014), convection (e.g., Markowski and Richardson, 2010;Simon et al., 2018), radiative transfer (e.g., Rozanov et al., 2014), the diurnal temperature cycle (e.g. Blandford et al., 2008), and air pollution (e.g., Whiteman et al., 2014) -to just name a few.
Forecasts of temperature profiles also suffer from systematic errors of the underlying NWP models. Additionally, the forecast uncertainty represented by the EPS has systematic errors and is typically underdispersive (Hagedorn et al., 2012). This article uses forecasts from a global NWP EPS model and radiosonde measurements (Sect. 2) and adapts univariate and multivariate methods for the postprocessing of vertical temperature profiles (Sect. 3). Since the vertical stratification of the atmosphere drastically constrains the air motion and exchange processes, the correlation structure between temperatures at model levels will have to be preserved instead of correcting each level independently. The results of these corrections are presented in Sect. 4 and discussed in Sect. 5.

Data
To develop and demonstrate methods for the correction of vertical temperature profiles forecast by NWP models, we use the lower two-thirds of the troposphere from the surface to 400 hPa. Many processes that affect its lowest part, the planetary boundary layer, are parameterized in an NWP model, and substantial systematic errors occur (cf. Holtslag et al., 2013), which can then be alleviated with statistical postprocessing. Three years of data from 15 September 2016 to 15 September 2019 were available.

Observations
Temperature profiles are from radiosondes where a sensor package carried aloft by a balloon transmits data back to a ground station. Vertical resolution of the data as available in global databases is variable on the order of 100 m. We chose four of the very few available radiosonde stations that measure four times a day instead of only twice in order to capture the diurnal cycle. These four stations are spread throughout Germany in flat to hilly terrain with launch times at 00:00, 06:00, 12:00, and 18:00 UTC: Bergen (52.82 • N,9.92 • E), Lindenberg (52.21 • N, 14.12 • E), 7.33 • E) and Kuemmersbruck (49.43 • N,11.90 • E). All the stations are located at the mid-latitudes and none of the sites is characterized by complex terrain. The average distance between the stations is approx. 300 km. Quality-controlled temperature and pressure data (see Durre et al., 2008, for details) were accessed from the freely available Integrated Global Radiosonde Archive of the National Oceanic and Atmospheric Administration (NOAA, 2019).

Forecasts
We use forecasts from the EPS of the European Centre for Medium Range Weather Forecasts (ECMWF) with 50 perturbed ensemble members and one unperturbed control run (ECMWF, 2012). The model is discretized in a 0.25 • ×0.25 • horizontal grid on 91 vertical levels. Since data from individual ensemble members on model levels are not archived at ECMWF, we saved real-time forecast data from 15 September 2016 to 15 September 2019 for 25 lead times at 6-hourly time intervals out to +144 h (6 d). The forecasts are interpolated bi-linearly to the radiosonde locations in the horizontal. Due to the use of a hybrid coordinate system in the ECMWF forecasts, the pressures of the 51 ensemble members at each model level differ somewhat. Therefore, observations and NWP forecasts are linearly interpolated to the arithmetic mean of the 51 pressures at each model level.
In the remainder, the radiosonde measurements T obs are used as response variables. For univariate statistical postprocessing only the sample mean T ens and sample standard deviation SD ens of the ensemble temperature forecasts are used as the ensemble predictors, with the assumption that the ensemble members can be adequately described by a Gaussian distribution. For multivariate postprocessing model-level temperatures of all ensemble members are used.

Methods
The goal of postprocessing is to alleviate systematic errors in the vertical temperature profile and produce a probabilistic forecast that is calibrated and sharp. For the univariate case, "calibrated" means that the verifying observations are equally likely to fall into the bins into which the ordered NWP ensemble members partition the real line. The sharper a calibrated predictive distribution is, the smaller the uncertainty of the forecast will be.
Postprocessing will proceed in two steps. First, to correct the marginal distributions of temperature at all vertical height levels simultaneously, non-homogeneous Gaussian regression (NGR,  will be extended by two different methods described in Sect. 3.1 and 3.2. In the second step the vertical interdependence between the levels discarded in step one will be retrofitted with the multivariate method of ensemble copula coupling (ECC, Sect. 3.3). Multivariate here refers to temperature at multiple vertical height levels, not to multiple observational sites or lead times, which are handled separately. Thus, the models described in the following sections are fitted to each site and for each lead time individually. For the whole data set with 4 sites and 25 lead times, 100 NGRVC models and 100 SAMOS models have to be fitted. Each of these models covers all vertical height levels. A key technique for fitting the nonlinear regression models behind NGRVC and SAMOS is the class of generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos, 2005), which is introduced generically in Appendix A.

Univariate correction: nonhomogeneous Gaussian regression with varying coefficients (NGRVC)
Nonhomogeneous Gaussian regression  is a postprocessing technique for ensemble forecasting systems. The observed temperature y is assumed to follow a Gaussian distribution, determined by two parameters, µ and σ . The expectation of a Gaussian distributed random variable is E(y) = µ. Thus, µ is often referred to as expectation or mean. σ is often denoted as the standard deviation. However, in order to make these parameters easier to distinguish from the sample mean and sample standard deviation, we call µ and σ the location and scale parameter, which is also common in the statistical literature. To condition the Gaussian distribution on covariates derived from the NWP model, the location µ is expressed by a linear model of the ensemble (sample) mean T ens and the scale parameter σ by the sample standard deviation of the NWP ensemble members SD ens : and log(σ ) = γ 0 + γ 1 · SD ens .
Originally,  estimated the coefficients β and γ with a sliding training window in order to allow for their seasonal variation. An NGR, which can exploit the full data set and have coefficients that may vary not only seasonally but also by arbitrary other factors (NGRVC), can be constructed with GAMLSS models. Then a sum of several nonlinear functions replaces the intercepts and linear coefficients in this very recent extension of NGR (cf. Simon et al., 2017;Lang et al., 2020). Here, the location parameter µ is chosen as Both the intercept and the linear coefficient are allowed to vary smoothly with functions of pressure p, season (day of year doy) and their interaction. The |hh in the index of the function for the pressure dependence indicates that this function is conditional on the time of the radiosonde ascent (00:00, 06:00, 12:00, and 18:00 UTC in the current case) in order to account for diurnal variations. The terms f 4|hh , f 5 and f 6 provide a varying linear coefficient for the ensemble mean T ens and thus allow the coefficient β 1 in Eq.
(2) to vary smoothly. The functional terms f 1 -f 3 act on the scale of the additive predictor (here temperature in degrees) and serve as varying intercept.
A similar model is chosen for the logarithm of the scale parameter, where the functions f 1 -f 3 act on the logarithmic scale and allow the coefficient γ 0 in Eq.
(3) to vary smoothly. Functions f 4 -f 6 provide a varying linear coefficient of the logarithm of the standard deviation and allow γ 1 of Eq.
(2) to vary smoothly. Note that the smooth functions f in Eqs. (3) and (4), which are identically named for simplicity, might vary in their form; e.g., the day of the year term in Eq.
(3) might have a different functional form than the day of the year term in Eq. (4). For more details on the relation between the classical sliding window approach and this smooth model approach, the reader is directed to Lang et al. (2020).

Alternative univariate correction: standardized anomalies model output statistics
This section provides an alternative method to remove the constraint that NGR models have to be fitted to each location separately. In the current case location refers to a particular level of the vertical profiles and not to multiple sites as in previous applications (Dabernig et al., 2017). Thus, the SAMOS models in the current study are fitted to the data of each site individually.
The motivation behind SAMOS is that with many locations and/or for operational use fitting many models might become computationally expensive. However, one single NGR-like regression model can be fitted to all locations at once when in a first step all site-specific characteristics are (nearly) eliminated from observations and ensemble data. The model is fitted not to the direct values, but rather to standardized anomalies formed by subtracting the respective climatological mean and dividing by the climatological standard deviation. This SAMOS approach is described in Dabernig et al. (2017).
Here, we estimate the climatological properties of the observations with a GAM(LSS); for the expected value as with similar terms to Eq. (3). Only an intercept is estimated for the logarithm of the scale parameter σ , which technically reduces the GAMLSS to a GAM. One set of (µ, σ ) has to be fitted for the radiosonde measurements, another one for the EPS data. The pressure levels of the observational radiosondes and the EPS data are not the same. We linearly interpolated the spatially better resolved observations to the pressure levels of the EPS data (see Sect. 2.2). Then the standardized anomalies of the observations and the ensemble are linked by a simple linear regression of the original NGR Eq. (2).

Multivariate correction: ensemble copula coupling
Both NGRVC (Sect. 3.1) and SAMOS (Sect. 3.2) are univariate postprocessing methods; i.e., only marginal distributions are predicted. To account for dependence structure of the vertical levels, ensemble copula coupling (ECC, Schefzik et al., 2013) is employed, which uses the dependence structure contained in the vertical profiles of the raw ensemble via its order statistics.
The ECC procedure consists of three steps.
1. Sample quantiles from the predicted marginal distributions.
2. Access the rank structure of the raw ensemble.
3. Arrange the quantiles from step 1 with the ranks of step 2.
In step 1 temperature quantiles have to be sampled from the univariate distributions obtained by either NGRVC (Sect. 3.1) or SAMOS (Sect. 3.2) for each level. The number of sampled quantiles matches the number of ensemble members m. Three different approaches are commonly employed (Schefzik et al., 2013) for sampling these quantiles.
-ECC-Q transforms m equally spaced probabilities with the quantile function (i.e., the inverse of the cumulative distribution function) of the predictive distribution.
-ECC-R draws samples randomly from the predictive distribution.
-ECC-T first fits a Gaussian distribution to the raw ensemble and then evaluates this cumulative distribution function (cdf) at the values of the raw ensemble leading to m probabilities, which are finally transformed with the inverse of the predictive cdf.
We discuss these three sampling approaches further in the case studies (Sect. 4.2.1). After this first step in the ECC procedure we have sorted quantiles of temperatures for each vertical level.
The second step is to access the ranks of the raw ensemble temperatures for each level: the ensemble member with the lowest temperature gets rank 1, the member with the second lowest temperature gets rank 2, and so on up to the member with the highest temperature, which gets rank m. In case of ties, e.g., two members predicting the same temperature, the ranks are drawn randomly. For instance, when the minimum of the ensemble occurs twice, ranks 1 and 2 are assigned randomly to these two members. At the end of this step we have ensemble ranks for each level.
The final step of the ECC procedure takes the quantiles from step one and arranges them using the ranks of step two. For instance, the lowest temperature within the quantiles of step one will be associated with the member that has rank 1, the second lowest quantile with the member of rank 2, and so on. This is applied for each level individually. Despite the fact that these steps are applied to each vertical level individually, ECC is nevertheless a multivariate method. The rank orders for individual levels are all taken from a single numerical ensemble prediction that conserves the correlation structure across the levels. As a whole, the procedure results in a prediction of vertical temperature profiles with margins (at each level) calibrated in location and spread (by SAMOS or NGRVC) and the rank order structure of the raw ensemble.

Verification measures
Several metrics are used to evaluate the performance of the forecasting methods in achieving the goal of probabilistic forecasts: to maximize the sharpness of the predictive distribution subject to calibration. The metrics are applied out-ofsample using 5-fold cross-validation to test on independent data and avoid overfitting.
Calibration and sharpness of univariate predictions are evaluated using the probability integral transform (PIT) histogram and the sharpness diagram, respectively . Calibration refers to the statistical consistency between predicted distributions and observations. A calibrated forecast leads to a uniform PIT histogram. When the prediction is not available as a full probability distribution, but as samples of a distribution as in the case of the raw ensemble, one can assess calibration with the rank histogram. The rank histogram displays the rank of the observation in a vector merging observations and samples of the predictive distribution. If the observation is indistinguishable from the samples, the rank histogram is uniform; i.e., the forecast is calibrated.
Sharpness is an attribute of the forecast alone and refers to the concentration of the predictive distributions. This attribute is revealed in the sharpness diagram which displays the width of the central prediction interval of the predictive distributions. The aim of forecasting schemes is to obtain calibrated distributions that are as sharp as possible .
The rank histogram and the PIT histogram are common tools to verify the calibration of the forecasts in the univariate case. Thorarinsdottir et al. (2016) introduce four different types of their extensions to assess multivariate calibration. Wilks (2017) further investigates these methods in a simulation study and concludes that one should always use a combination of different multivariate rank histograms. In this study we employ the average rank histogram (ARH) and the banddepth histogram (BDH). The ARH provides a measure of ascending rank. In contrast, the BDH assesses the centrality of the observation within the forecast ensemble. One can thus not spot from the BDH whether a potential bias is positive or negative, which one can identify from the ARH.
As numerical measures we apply the continuous ranked probability score (CRPS) and the log score (LS) for univariate predictions . The LS is the negative log-likelihood. The CRPS can be interpreted as integral over all possible Brier scores and is a generalization of the mean absolute error for point forecasts (Hersbach, 2000).
The energy score (ES,  and the variogram score (VS, Scheuerer and Hamill, 2015b) are applied as numerical measures for assessing the performance of the multivariate predictions. The ES is the multivariate generalization of the CRPS. Like the CRPS, the ES gives absolute values only and is negatively orientated, meaning lower values are better then higher values. It addresses calibration of multivariate forecasts in terms of location and scale, i.e., mean and spread. However, some studies suggest that the ES is not sufficiently sensitive to misspecifications of the correlation structure (e.g., Pinson and Girard, 2012). Thus, Scheuerer and Hamill (2015b) introduced the VS as an alternative multivariate score. We employ the VS with an order of 0.5, which was found to discriminate well between correct and incorrect correlations and between calibrated and overdispersive/underdispersive forecasts (cf. Eq. 3 in Scheuerer and Hamill, 2015b). The VS is an absolute measure too, and its skill is negatively orientated.
All scores are computed out-of-sample by 5-fold crossvalidation using the scoringRules package (Jordan et al., 2019) for the R environment for statistical computing (R Core Team, 2019).
The validation of the numerical scores was done by bootstrapping to estimate the uncertainty in the scores. This was applied by drawing a random sample in the size of the original data with replacement and averaging it. This procedure was repeated 500 times, and the 500 values are presented as box-and-whisker plots. In order to preserve the vertical dependence structure while bootstrapping, we considered a vertical profile to be the smallest indivisible unit: rather than bootstrapping temperatures on vertical levels and merging these samples to new profiles, we bootstrapped vertical temperature profiles as a whole for different times, locations and lead times.

Results
The presentation of the results proceeds from the univariate case using the two NGR extensions (NGRVC and SAMOS) to the multivariate case, which is introduced with case studies of winter and summer profiles, respectively, and then verified for the whole data set. The response variable for all approaches is the temperature measured by the radiosondes.

NGR with varying coefficients (NGRVC)
Allowing varying coefficients of the NGR makes a diurnally, seasonally and vertically varying correction of the expected value of the Gaussian forecasting distribution of the EPS possible -as described by Eq. (3). Figure 1 illustrates these effects. Panels (a) and (c) show the diurnally varying vertical contributions to the varying intercept and the varying linear coefficient in the model of the location parameter µ. The strongest correction from the intercept of the early morning sounding at 06:00 UTC is in the layer close to the surface, which would increase the strength of the surface-based in-version in the EPS. The intercept term warms the 12:00 UTC sounding in the boundary layer and immediately above and cools the EPS profile somewhat in the middle of the troposphere. However, the intercept term cannot be seen in isolation. The vertical correction of the varying linear coefficient term (Fig. 1c) warms the original EPS profile on average over the whole year, with the strongest warming in the 18:00 UTC evening/early night sounding. Panels (b) and (d) in Fig. 1 show the effects of the seasonally varying corrections f i (doy) and the interaction of seasonal and vertical corrections f j (doy, p) for the average (over the whole year) 12:00 UTC soundings. The intercept correction adds the seasonal cycle, with the strongest variations in the boundary layer. The seasonally changing correction of the linear term, on the other hand, is less pronounced, especially above the boundary layer, and thus only linearly depends on the ensemble sample variance SD ens .

Climatology for SAMOS
SAMOS, a further extension to NGR, makes it possible to correct all vertical levels with one postprocessing model, but requires climatologies of expected value and standard deviation in order to compute the prerequisite standardized anomalies (cf. Sect. 3.2). They were computed with 3 years of observational and NWP data within the mathematical framework of GAMLSS (Sect. A) and the model specification in Eq. (5).
The effects of the climatology for response T obs are given in Fig. 2. f 1|hh (p) dominates the climatology by capturing the diurnal cycle at the surface (Fig. 2a). Figure 2b shows that the seasonality and the seasonality by pressure f 2 (doy) + f 3 (p, doy) have a positive effect in summer and a negative effect in winter and that the amplitude is strongest at the surface. The model thus adds the most temperature to the expected value µ in Eq. (A1) in summer and subtracts the most in the winter season. An n = 10 000 random sample of the standardized anomalies of T obs (derived from the climatology) is plotted in Fig. 2c. The sample mean and standard deviation (0.01; 1.00) are almost identical to the mean and standard deviation (0; 1) of a standard Gaussian distribution. The line structure in Fig. 2c originates from the model levels of the ECMWF-EPS onto which we interpolate vertically. The quantile-quantile plot in Fig. 2d compares the standardized anomalies in Fig. 2c with the standard Gaussian distribution. The assumption of the Gaussian distributed air temperature is valid but not perfect in the tails beyond 2 sigma.
A climatology is also computed for the ensemble forecasts. The climatology of the ensemble mean is modeled with a single GAMLSS. The standardized anomalies of T ens are similar to those of the observations in Fig. 2c. A simpler approach is sufficient for SD ens , which is standardized with the standard deviation of the difference of T ens and the expected value of the climatology, instead of a GAMLSS model

SAMOS
Climatological values for mean and standard deviation from Sect. 4.1.2 are then used to standardize the response variables by subtracting the mean and then dividing by the standard deviation, which yields the standardized anomalies T obs , T ens and SD ens . This standardization removes most of the pressure-level specific information so that the simplest version of NGR (Eq. 2, with T ens instead of T ens and SD ens for SD ens ) can be applied simultaneously to all pressure levels, seasons and times of (radiosonde) ascent. Corrections for the expected values are fairly small. Using station Bergen as an example, the coefficients of the standard anomaly Eq. (2) are β 0 = 0.0016 (0.0007, 0.0025) and β 1 = 0.9325 (0.9316, 0.9333). Since the intercept is almost zero, basically no offset correction has to be applied. A linear coefficient of less than 1 means that larger values of the standardized anomalies are corrected less than values close to the climatological mean. The respective coefficients for the standardized anomaly of the (log) of the standard deviation are γ 0 = −1.7922 (−1.7966, −1.7877) and γ 1 = 2.0677 (2.0546, 2.0808), respectively. Consequently, a substantial offset correction has to be applied and larger deviations from the climatological mean of the ensemble variance are more strongly corrected.

Comparison of calibration and sharpness
Several verification measures are used to compare calibration and sharpness of the distributions resulting from univariate postprocessing with SAMOS and NGRVC, respectively. Calibration is first evaluated with PIT histograms. Figure 3 shows a U shape for the raw ensemble, indicative of a strong underdispersion with truth too frequently beyond the extremes of the forecast. Both SAMOS and NGRVC are better calibrated although slightly overdispersive. The shape might indicate a misspecification of the underlying parametric distribution and possibly skewness (Gebetsberger et al., 2018(Gebetsberger et al., , 2019. Figure 4 zooms into three vertical layers. The strongest underdispersion of the raw ensemble and thus the largest improvement from postprocessing occurs in the lowest layer, 1020-850 hPa, occupied mostly by the planetary boundary layer. The postprocessing of NGRVC and SAMOS cannot completely eliminate the underdispersion in the PIT histogram, which is computed out-of-sample. Deviations of the raw ensemble from the truth are rarer above the boundary layer in the layers 850-700 hPa and 700-400 hPa, where postprocessing effectively removes a cold bias. Overall, the two different approaches of NGRVC and SAMOS have similar PIT histograms at the respective vertical layers. After calibration the predictive distributions slightly lose sharpness: Fig. 5 shows this increase for widths of the central prediction intervals of 50 % and 90 % (cf. , respectively. SAMOS and NGRVC have approximately the same average widths. However, postprocessing reduces the number of times that the prediction interval is extremely (and most likely unnecessarily) wide, as the reduction in the width of the interquartile range and the whiskers  Table 1, SAMOS considerably improves the log-likelihood and the CRPS, upon which NGRVC improves further. Due to this slight advantage, the NGRVC-postprocessed ensemble is used in the following multivariate postprocessing.

Multivariate postprocessing
Ensemble copula coupling (Sect. 3.3) is used to restore the vertical correlation structure of the temperature profiles contained in the raw ensemble to the univariately corrected temperatures at each level. To better understand the three methods for sampling from the raw ensemble (ECC-Q, ECC-R, ECC-T), they are first demonstrated with two case studies before their forecast performance is evaluated for the whole data set.

Case studies
The two case studies show ensemble copula coupling at work in typical winter and summer settings, respectively. The winter morning case is characterized by a strong surface-based temperature inversion, topped by a dry-adiabatically stratified layer, which is capped by a second inversion, as shown by the radiosonde measurements (bold black line) in Fig. 6. All members of the +30 h raw ensemble forecast (magenta solid lines) capture the surface-based inversion, albeit at different altitudes and with different strengths. The spread of the ensemble members around the second inversion is much larger -up to 7 K -and a substantial number of members have only a stable layer but no inversion. Further aloft all members stay within a few degrees of one another and of the observed profile. Figure 7 shows the performance of the postprocessing methods with a zoom into the lower part of the atmosphere where the raw ensemble members have larger deviations. The forecast performance of the raw ensemble deteriorates considerably when the forecasting horizon is extended by another 4 d from +30 to +126 h (second   Fig. 6 with additional postprocessed temperature profiles. The columns show (from left to right) raw ensemble; univariate postprocessing with NGRVC; and multivariate postprocessing with ensemble copula coupling (ECC) of the NGRVC-postprocessed ensemble using quantile sampling (ECC-Q) and distance-preserving sampling (ECC-T), respectively. All figures additionally have observation (bold black), climatological observation (mean (dashed gray) and standard deviation (gray)), and two particular forecast members. The first row contains the forecasts for lead time +30 h, the second row for +126 h. Note that a negative temperature gradient in pressure coordinates means an increase in temperature with altitude. row in Fig. 7). The larger overall spread among the ensemble members indicates the typical increase in forecast uncertainty with forecasting horizon. Especially the second inversion is poorly forecast -either not at all or at different altitudes and/or of different strength -and the spread of the ensemble members is as large as the 2-σ range of the observation climatology, which is indicated by the gray shading.
How well do the different methods correct the raw profiles? The univariate NGRVC method (second column of Fig. 7) describes the ensemble members parametrically with a Gaussian distribution and thus corrects only the ensemble mean and standard deviation as a function of the date (here fixed at 6 December 2016, 06:00 Z) and pressure level. Consequently all profiles are parallel to each other. The profiles of two exemplary ensemble members in the +126 h forecast no longer cross as they do in the raw ensemble. Also, the sharpness of the inversions in individual ensemble members is lost when the ensemble mean is formed since they are at different altitudes. However, the spread among individual members might still vary considerably with altitude since it is corrected separately (cf. Sect. 3.1), as is most obvious at the second inversion for the +126 h forecast.
The multivariate ensemble copula coupling method with sampling from the quantiles of the raw ensemble (ECC-Q) restores the overall shape of the member profiles and thus allows the cross-over near the surface of the blue lines in the +126 h forecast. It does not preserve the altitude-dependent spread. For example, the spread above the observed second inversion is reduced considerably by ECC-Q, which effec-tively eliminates the second inversion from some members. The ensemble copula coupling with distance-preserving correction (ECC-T), on the other hand, keeps the large uncertainty above the observed second inversion. Its corrections are slight, on the order of 1 K over the whole profile.
The vertical temperature gradient determines how easily an air parcel can be displaced in the vertical, which in turn determines exchange processes of, e.g., pollutants and the formation of clouds. Figure 8 shows this gradient, which is the first vertical derivative of the sounding profiles in Fig. 7. Although it is not the response variable of the postprocessing and fitting a model directly to the temperature gradients would likely yield better results, postprocessing should ideally still preserve gradients. The observations (bold lines) show an extremely large gradient in the approximately 30 hPa (300 m) above the surface, which strongly dampens vertical exchange processes. Another albeit weaker such layer is at around 880 hPa. All members of the raw ensemble show the surface-based extremum in a range between a third to the whole observed value. All members also contain the higher-level extremum with only small differences among the members albeit at only half the observed value. Most of the ensemble members of the NWP forecast computed 126 h prior to the observations still capture the strong gradient at the surface. Interestingly a substantial number of members capture the magnitude and depth of the strong gradient further aloft but with a wide range of its location, which is between 900 and 760 hPa. The univariate NGRVC correction (second column in Fig. 8) shifts all members towards a mean value and thus fails to show the larger variation of the temperature gradient. NGRVC only corrects the expectation and the scale of the ensemble, but not members individually. Most of the correction is to the mean. The scale correction is small and varies little over the whole profile, so that the gradient remains nearly the same. Both ensemble copula coupling methods, ECC-Q and ECC-T, on the other hand, bring back the uncertainty of the raw ensemble. For the shorterterm forecast of +30 h, the ECC postprocessing enhances the uncertainty at the lower and upper ends of the extrema of the gradients, while it reduces it (by less) for the longer +126 h forecast -more so for ECC-Q than for ECC-T, which can easily be seen in the two exemplary members (blue lines) of the +126 h forecast.
The second case study of a summer noon profile exemplifies the performance for a deep convective boundary layer capped by a thick stable layer and a nearly moist-adiabatic stratification aloft where the profile parallels the green saturated adiabat in Fig. 9. In contrast to the winter case, forecast uncertainty is much lower; the spread of the ensemble members within the boundary layer in the +36 h forecast is only a fraction of the 2-σ range of the climatological profile (gray shading) and still less than that range for the +108 h forecast as seen in Fig. 10. Similarly to the winter case, the spread above the top of the boundary layer is only about 1 K, and the observation lies within the range of the forecast. Since   Fig. 7 but for a summer case of 2 June 2017, 12:00 Z and lead times of +36 h (first row) and +108 h (second row), respectively. the univariate NGRVC method only corrects ensemble mean and standard deviation thin stable layers in individual ensemble members at the top of their respective surface-based convective boundary layers are not kept and smoothed out. Contrarily, both multivariate ensemble copula coupling methods are naturally able to keep them. As in the winter case, corrections from ECC-T are smaller than from ECC-Q. Without being able to fully explain this behavior, we speculate that this is related to the fact that ECC-Q assumes equally spaced probabilities for the transformation from probability space to temperature space. ECC-T borrows these probabilities from the raw ensemble. Thus, ECC-T could potentially represent outliers better than ECC-Q, as points in probability space can be distinct from the majority of the points.

Evaluation of forecast performance over the whole period
The performance of the different correction methods is evaluated both graphically with rank histograms and numerically with the energy score and variogram score (cf. Sect. 3.4) for the whole period 15 September 2016-15 September 2019. Figure 11 assesses the calibration of forecasts over all locations, lead times and pressure levels graphically through average (ARH) and band-depth (BDH) multivariate rank histograms. The strong excess at low ranks in both histograms suggests that the raw ensemble profiles are systematically too warm. The pronounced ∩ shape of the average rank histogram for the quantile samples of univariate NGRVC correction (second column in Fig. 11) indicates overdispersion. This is due to the construction of the NGRVC-Q ensemble, which assumes a correlation of unity between the temperatures at different levels (Wilks, 2017). All three versions of the multivariate ensemble copula coupling reduce this overdispersion, but fail to eliminate it. Figure 12 shows the univariate performance as evaluated with the CRPS (first column) and compares it to the multivariate performance evaluated with the energy score, which is the multivariate generalization of the CRPS (second column), and with the variogram score, which weights the representation of the correlation structure more strongly (third column). Lower values of all three scores are better. When evaluated over all lead times and vertical levels (first row) the univariate metric of the CRPS score in Fig. 12 obviously shows the largest improvement for the univariate postprocessing method (NGR-Q). However, the CRPS score for the Figure 12. Bootstrapped continuous ranked probability score (CRPS, first column), energy score (ES, second column) and variogram score (VS, third column) for the raw ensemble (RAW), the quantile sample of the NGR-postprocessed (NGRVC-Q), the quantile sample of ECC (ECC-Q), the random sample of ECC (ECC-R) and the transformed sample of ECC (ECC-T). The first row shows results for all pressure ranges and lead times combined, the second row stratified by three pressure ranges but over all lead times, and the third row by three lead times. Note that the range of the scores changes between rows. The bottom row shows the scores for the temperature gradient dT /dp. Lower values are better for all three scores. multivariate ensemble copula coupling methods is on par or only slightly worse. When the multivariate metrics of energy score (Fig. 12a) and variogram score (Fig. 12c) are used, the univariate one falls decisively behind even the raw ensem-ble, as already seen in the rank histograms (Fig. 11) and the winter case study (Fig. 8).
When the scores are stratified into three pressure ranges (second row of Fig. 12) it can be seen in all three metrics that the lowest layer next to the surface is the most difficult one for the raw ensemble. However, this is also the layer where the postprocessing achieves the largest improvements in all three scores. Unsurprisingly, the univariate postprocessing has poorer multivariate metrics of energy and variogram scores than the raw ensemble (with the exception of the energy score of the lowest layer). Improvements from the copula coupling methods in the lowest layer are much larger than in the two layers higher up. One result for which we have no explanation is why the energy score of the raw ensemble is better for the layer between 850 and 700 hPa, which is still partly in the boundary layer for at least part of the times, than for the layer above the altitude of the 700 hPa level. The variogram score, which more strongly weights the structure of the profile, does not show such a behavior.
A stratification by lead time (third row of Fig. 12) shows first the expected worsening of the raw ensemble scores for longer lead times. Again, the multivariate scores of the univariately postprocessed (NGR-Q) ensemble are worse than for the raw ensemble. While the copula coupling methods still improve the raw ensemble for a +72 h forecast, 144 h are too far into the future to still achieve an improvement over the raw ensemble data.
The last row of Fig. 12 shows the vertical temperature gradient, which is only a derived quantity but not the response variable of the postprocessing itself. Therefore it should not come as a surprise that the univariate postprocessing with NGR-Q performs worse than the raw ensemble for the univariate CRPS metric. The copula coupling methods manage to improve the energy score over the raw ensemble. However, they cannot beat the raw ensemble in the variogram score. Since the variogram score weights the correlation aspect of the multivariate performance more strongly, this indicates that if one is interested in accurate temperature gradient forecasts it might be advisable to directly postprocess the gradients themselves instead of deriving them from the postprocessed temperatures.

Discussion and conclusions
Postprocessing ensembles of numerical weather prediction model forecasts with statistical models trained on past ensemble forecasts and "truth", i.e., observations, improves these forecasts further and has thus been a burgeoning field of research, of which Vannitsem et al. (2018) give a comprehensive review. A recent push has been towards multivariate postprocessing, which honors the correlation structure -be it between different variables or the same variable along time or in space. One popular method is copula coupling. When applied in space, the method has so far been limited to (quasi)horizontal data, e.g., air temperature at three sites (three dimensions) (Schefzik, 2017) or (quasi)-horizontal fields of temperature, precipitation and wind (Schefzik et al., 2013).
Postprocessing the vertical structure of (ensemble) NWP forecasts, however, has remained largely unexplored. Since the vertical structure strongly influences exchange processes, onset and cessation of convection, formation of clouds and precipitation, improvement over the raw ensemble forecasts is arguably as important as for horizontal fields. Renkl (2013) started to explore the potential by postprocessing vertical temperature profiles from EPS forecasts by combining a reduction of the dimensionality with kriging. He reduced the dimensions of the vertical profiles by reconstructing them with their leading normal modes, which retained about 80 % of their variance. Subsequently the coefficients of these modes were adjusted with kriging for groups of ensemble members.
We take a different approach modeled on results for quasihorizontal postprocessing and postprocess the vertical profiles using a combination of univariate calibration and copula coupling. ECC needs probability distributions, also known as margins, at each pressure level. The margins are obtained by two univariate techniques, which are enhanced variants of the classical nonhomogeneous Gaussian regression (NGR). When the coefficients of NGR vary diurnally, seasonally and in the vertical (NGRVC, Simon et al. (2017); Lang et al. (2020)) the performance is slightly better than when using standardized anomalies with precomputed diurnally, seasonally and vertically varying climatologies of the SAMOS approach (Dabernig et al., 2017;Stauffer et al., 2017). While NGRVC was used here for preprocessing the data for ECC, SAMOS is equally suitable. NGRVC only corrects the expectation and the scale of the whole ensemble, but not of members individually. If the scale correction varies little over the whole profile, as is the case here, the vertical gradient remains nearly the same, which might be a disadvantage when achieving strongly varying gradients is important for the further application of the postprocessed profiles. Stable layers, for example, are crucial for the impediment of vertical motions and exchange: they determine the top of the boundary layer and the mixing volume for pollutants. While their scale might be too small to have predictability to +6 d, the processes forming them such as surface heating and large-scale subsidence might still be predictable. For a forecaster it is important to know about the possible existence of (very) stable layers despite an uncertainty in altitude, thickness and strength instead of having to infer their existence from a slight increase in stability in a smoothed-out profile that a univariate postprocessing such as NGRVC delivers.
Using the univariately corrected margins for further and multivariate postprocessing with copula coupling better reproduces such (potentially very thin) stable layers. With ECC the rank order structure of the ensemble of NWP forecasts from the ECMWF-EPS is conserved. Several sampling strategies might be used with copula coupling. We used three. To sample randomly (ECC-R) is unadvisable as it may deliver worse results than the raw ensemble forecasts. On the other hand, quantile resampling (ECC-Q) and sampling with rescaling of the raw ensemble by conserving the relative distances between ensemble members (ECC-T) and thus also accounting for extreme ensemble members are successful. Of the two, ECC-T is overall better than ECC-Q in all three verification measures used: rank histograms, energy score and variogram score.
The largest improvements are obtained for the profiles in the atmospheric boundary layer over all lead times and all seasons and also in the two case studies shown. Consequently, this is also the layer where the largest potential for improvements of the NWP model is, which is a well-known fact (e.g., Holtslag et al., 2013) since many processes shaping the boundary layer have to be parameterized in such models. The current postprocessing approach has potential for further improvement by adding more covariates, as has been shown in other postprocessing scenarios (e.g., Scheuerer, 2014;Scheuerer and Hamill, 2015a;Messner et al., 2017). For some practical forecasting purposes, the humidity forecasts have to be additionally postprocessed with the constraint that the dew-point temperature must not exceed the (dry-bulb) temperature.