Forecast score distributions with imperfect observations

The classical paradigm of scoring rules is to discriminate between two different forecasts by comparing them with observations. The probability distribution of the observed record is assumed to be perfect as a verification benchmark. In practice, however, observations are almost always tainted by errors and uncertainties. If the yardstick used to compare forecasts is imprecise, one can wonder whether such types of errors may or may not have a strong influence on decisions based on classical scoring rules. We propose a new scoring rule scheme in the context of models that incorporate errors of the verification data. We rely on existing scoring rules and incorporate uncertainty and error of the verification data through a hidden variable and the conditional expectation of scores when they are viewed as a random variable. The proposed scoring framework is compared to scores used in practice, and is expressed in various setups, mainly an additive Gaussian noise model and a multiplicative Gamma noise model. By considering scores as random variables one can access the entire range of their distribution. In particular we illustrate that the commonly used mean score can be a misleading representative of the distribution when this latter is highly skewed or have heavy tails. In a simulation study, through the power of a statistical test and the computation of Wasserstein distances between scores distributions, we demonstrate the ability of the newly proposed score to better discriminate between forecasts when verification data are subject to uncertainty compared with the scores used in practice. Finally, we illustrate the benefit of accounting for the uncertainty of the verification data into the scoring procedure on a dataset of surface wind speed from measurements and numerical model outputs.


Introduction
Probabilistic forecast evaluation generally involves the comparison of a probabilistic forecast cumulative distribution function (cdf) F (in the following, we assume that F admits a probability density function f ) with verification data y that could be of diverse origins (Jolliffe and Stephenson, 2004;. In this context verification data are most of the time, but not exclusively, observational data. Questions related to the quality Figure 1: An "perfect" (dark grey) and "imperfect" (light grey) synthetic forecasts, with respective distributions N (0, 2) and N (0.5, 4), are compared with "perfect" verification data (left) with distribution N (0, 2) and imperfect verification data Y (right) with distribution N (0, 3). Comparing an perfect forecast (forecast having the same distribution as the true data) to corrupted verification data leads to an apparent under-dispersion of the perfect forecast. and variability of observational data have been raised and tackled in different scientific contexts. For instance, data assimilation requires careful estimation of uncertainties related to both numerical models and observations (Daley, 1993;Waller et al., 2014;Janjić et al., 2017). Apart from a few studies (see, e.g. Hamill and Juras, 2006;Ferro, 2017), error and uncertainty associated with the verification data have rarely been addressed in forecast evaluation. Nonetheless, errors in verification data can lead to severe mis-characterization of probabilistic forecasts as illustrated in Figure 1, where an perfect forecast (forecast having the same distribution as the true data) appears under-dispersed when the verification data have a larger variance than the true process. In the following, the underlying hidden process that is materialized by model simulations or measured through instruments, will be referred to as the true process. Forecast evaluation can be performed qualitatively through visual inspection of statistics of the data such as in Figure 1 (see, e.g. Bröcker and Ben Bouallègue, 2020), and quantitatively through the use of scalar functions called scores . In this work, we focus on the latter, we illustrate and propose a framework based on hidden variables to embed imperfect verification data into scoring functions via priors on the verification data and on the hidden state.

Motivating example
To illustrate the proposed work, we consider surface wind data from a previous work (Bessac et al., 2018). Time series of ground measurements from the NOAA Automated Surface Observing System (ASOS) network are available at ftp://ftp.ncdc.noaa.gov/pub/data/ asos-onemin and are extracted at 1 minute resolution. We focus on January 2012 in this study, the data are filtered via moving-average procedure and considered at an hourly level leading to 744 data-points. These ground-station data are considered as verification data in the following. Outputs from numerical weather prediction (NWP) forecasts are generated by using WRF v3.6 (Skamarock et al., 2008), a state-of-the-art numerical weather prediction system designed to serve both operational forecasting and atmospheric research needs. The NWP forecasts are initialized by using the North American Regional Reanalysis fields dataset that covers the North American continent with a resolution of 10 minutes of a degree, 29 pressure levels (1000-100 hPa, excluding the surface), every 3 hours from the year 1979 until the present. Simulations are started every day during January 2012 with a forecast leadtime of 24 hours and cover the continental United States on a grid of 25x25 km with a time resolution of 10 minutes. Only one trajectory is simulated in this study. As observed in Figure 2, the uncertainty associated with observation data can affect the evaluation of the forecast. As an extension, if two forecasts were to fall within the uncertainty range of the observations, it would require a non-trivial choice from the forecaster to rank forecasts. We will apply our proposed scoring framework to this dataset in Section 5.2 and compute scores embedding this uncertainty.

Imperfect verification data and underlying physical state
In verification and recalibration of ensemble forecasts, an essential verification step is to find data that precisely identify the forecast events of interest, the so-called verification dataset (see, e.g., Jolliffe and Stephenson, 2004). Jolliffe and Stephenson (2004), in Section 1.3 of their book, discussed the uncertainty associated with verification data such as sampling uncertainty, direct measurement uncertainty, or changes in locations in the verification dataset. Verification data arise from various sources and consequently present various types of errors and uncertainties such as measurement error (Dirkson et al., 2019), design and quality: e.g., gridded data versus weather stations, homogeneization problems, sub-sampling variability (Mittermaier and Stephenson, 2015), or mis-matching resolutions. Resolutions mis-matching is becoming an issue due to weather or regional climate models running at ever finer resolution with increased fidelity for which observational data are rarely available for evaluation, hence requiring to account for up-or down-scaling error. In order to provide accurate forecast evaluations, it is crucial to distinguish and account for these types of errors. For instance, the effect of observational error can be stronger at short-term horizon when forecast error is smaller. Additive models have been used for the observational error (Ciach and Krajewski, 1999;Saetra et al., 2004) or used to enhance coarser resolutions to allow comparison with fine resolution data (Mittermaier and Stephenson, 2015). In the following, we will present an additive and a multiplicative framework based on a hidden variable to embed uncertainty in data sources.
A common way of modeling errors is by representing the truth as a hidden underlying process also called state (Kalman, 1960;Kalman and Bucy, 1961). Subsequently each source of data is seen as a proxy of the hidden state and modeled as a function of it. This forms the basis of data assimilation models where the desired state of the atmosphere is estimated through the knowledge of physics-based models and observational data that are both seen as versions of the non-observed true physical state. In the following, we base our scoring framework on the decomposition of the verification data as a function of a hidden true state, referred to X, and an error term.

Related literature
Uncertainty and errors arise from various sources such as the forecast and-or the verification data (predictability quality, uncertainty, errors, dependence and correlation, time-varying distribution) and the computational approximation of scores. A distribution-based verification approach was initially proposed by Murphy and Winkler (1987), where joint distributions for forecast and observation account for the information and interaction of both datasets. Wilks (2010) studied the effect of serial correlation of forecasts and observations on the sampling distributions of Brier score and in particular serially correlated forecasts inflate its variance. Bolin and Wallin (2019) discussed the misleading use of average scores, in particular for the continuous ranked probability score (CRPS) that is shown to be scaledependent, when forecasts have varying predictability such as in non-stationary cases or exhibit outliers. Bröcker and Smith (2007), in their conclusion, highlighted the need of generalizing scores when verification data are uncertain in the context of properness and locality. More specifically, robustness and scale-invariance corrections of the CRPS are proposed to account for outliers and varying predictability in scoring schemes. Concerning the impact of the forecast density approximation from finite ensembles, Zamo and Naveau (2018) compared four CRPS estimators and highlighted recommendations in terms of the type of ensemble, whether random or a set of quantiles. In addition, several works focus on embedding verification data errors and uncertainties into scoring setups. Some methods aimed at correcting the verification data and use regular scoring metrics, such as perturbed ensemble methods (Anderson, 1996;Hamill, 2001;Bowler, 2008;Gorgas and Dorninger, 2012). Other works modeled observations as random variables and expressed scoring metrics in that context (Candille and Talagrand, 2008;Pappenberger et al., 2009;Pinson and Hagedorn, 2012). Some approaches directly modified the expression of metrics, (Hamill and Juras, 2006;Ferro, 2017), and others (see, e.g. Hamill and Juras, 2006) accounted for varying climatology by sub-dividing the sample into stationary ones. In Ciach and Krajewski (1999) and Saetra et al. (2004), additive errors were embedded in scores via convolution. Analogously to the Brier score decomposition of Murphy (1973), Weijs and Van De Giesen (2011);Weijs et al. (2010) decomposed the Kullback-Leibler divergence score and the cross-entropy score into uncertainty into reliability and resolution components in order to account for uncertain verification data. Kleen (2019) discussed scores scale sensitivity to additive measurement errors and proposed a measure of discrepancy between scores computed on uncorrupted and corrupted verification data.

Proposed scoring framework
The following paper proposes an idealized framework to express commonly used scores with observational uncertainty and errors. The new framework relies on the decomposition of the verification data y into a "true" hidden state x and an error term, and on the representation of scores as a random variable when the verification data is seen as a random variable. Information about the hidden state and the verification data are embedded via priors into the scoring framework, sharing analogies with classical Bayesian analysis (Gelman et al., 2013). More precisely, the proposed framework relies on the knowledge of the conditional distribution of the "true" hidden state given the verification data, or on information which allows to calculate that conditional distribution.
Section 2 introduces a scoring framework that accounts for errors in the verification data for scores used in practice. Sections 3 and 4 respectively implement the additive and multiplicative error-embedding cases for the logarithmic score (log-score) and CRPS. Section 5 illustrates the merits of scores in a simulation context and a in real application case. Finally, Section 6 provides a final discussions and insights in future works. In particular, we discuss the possible generalization of the proposed scoring framework when the decomposition of the verification data y into a hidden state x and an error does not fall into an additive or multiplicative setting as in Sections 3 and 4.

Scoring rules under uncertainty
In the following, we propose a version of scores based on the conditional expectation of what is defined an "ideal" score given the verification data tainted by errors, when scores are viewed as random variables. The idea of conditional expectation was also used in (Ferro, 2017), but with a different conditioning and is discussed below as a comparison.

Scores as random variables
In order to build the proposed score version as well as to interpret scores further than through their mean and assess for their uncertainty, we will rely on scores represented as random variables. In practice, scores are averaged over all available verification data y, and the uncertainty associated with this averaged score is mostly neglected. However this uncertainty reveals to be significant as pointed it out in Dirkson et al. (2019) where confident intervals were computed through bootstrapping. In order to assess, the probability distribution of score s 0 , we assume Y is a random variable representing the verification data y and write a score as a random variable s 0 (F, Y ), where F is the forecast cdf to be evaluated. This representation gives access to the score distribution and enables to explore the uncertainty of this latter. A few works in the literature have already considered scores as random variables and performed associated analysis. In Diebold and Mariano (2002) and Jolliffe (2007), scores distributions were considered to build confidence intervals and hypothesis testing to assess differences in scores values when comparing forecasts to the climatology. In Wilks (2010), the effect of serial correlation on the sampling distributions of Brier score was investigated. Pinson and Hagedorn (2012) illustrated and discussed scores distributions across different prediction horizons and across different level of observational noise.

Hidden state and scoring
Scores used in practice are evaluated on verification data y, s 0 (F, y). We define the "ideal" score as s 0 (F, x), where x is the realization of the hidden underlying "true" state that gives rise to the verification data y. Ideally, one would use x to provide the best assessment of forecast F quality through s 0 (F, x); however since x is not available, we consider the best approximation of s 0 (F, x) given the observation y in terms of L 2 -norm via the conditional expectation. For a given score s 0 , we propose the following score version s ∨ (., y): where X is the true hidden state, Y is the random variable representing the available observation used as verification data, and F is the forecast cdf to be evaluated. One can view this scoring framework incorporating information about the discrepancy between the true state and the verification data in terms of errors and uncertainty in a Bayesian setting. To compute (1), we assume that the distributional features of the law of X, and the conditional law of [Y |X], where [.] and [.|.] denotes respectively marginal and conditional distributions, are directly available or can be obtained. This is the classical framework used in data assimilation when both observational and state equations are given, and the issue is infer the realization x given observations y, see also our example in Section 5.2. Under this setup, the following properties hold for the score s ∨ : Details of the computations are found in Appendix A. The first equality guaranties that any propriety attached to the mean value of s 0 are preserved with s ∨ . The second inequality arises from the law of total variance and implies a reduced dispersion of the corrected score compared to the ideal score. This can be explained by the prior knowledge on the verification data that reduces uncertainty in the score. These properties are illustrated with simulated examples and real data in Section 5. As a comparison, Ferro (2017) proposed a mathematical scheme to correct a score when error is present in the verification data y. Ferro's modified score, denoted s F (f, y) in this work, is derived from a classical score, say s 0 (f, x) where x is the hidden true state. With these notations, the corrected score s F (f, y) is built such that it respects the following conditional expectation In other words, the score s F (f, y) computed from the y's provides the same value on average that the proper score computed from the unobserved true state x's. The modified score s F explicitly targets biases in mean induced by imperfect verification data. In terms of assumptions, we note that the conditional law of Y given X needs to be known in order to compute s 0 (f, x) from s F (f, y), e.g. see Definitions 2 and 3 in Ferro (2017). In particular, in Ferro (2017) the correction framework is illustrated with the logarithmic score in the case of a Gaussian error model (i.e. [Y |X = x] ∼ N (x, ω 2 )).

Implementation and generalization
The proposed score reveals desirable mathematical properties of unbiasedness and variance reduction while accounting for the error in the verification data; however it relies on the knowledge of [X|Y ], or equivalently of [Y |X] and [X]. We assume that the decomposition of Y into a hidden state X and an error, is given and is fixed with respect of the evaluation framework. Additionally, the new framework relies on the necessity to integrate s 0 (f, x) against conditional distributions, which might require some quadrature approximations in practice when closed formulations are not available. In this work, we assume to have access to the climatology distribution and we rely on this assumption to compute the distribution of X or priors of is distribution. However, depending on the context and as exemplified in Section 5.2, alternative definitions and computations of X can be considered as for instance relying on known measurement error models. Nonetheless, as illustrated in Sections 3 and 4, the simplifying key to the score derivation in Eqn.
(1) is to use Bayesian conjugacy when applicable. This is illustrated in the following sections with a Gaussian additive case and a Gamma multiplicative case. Although the Bayesian conjugacy is a simplifying assumption, as in most Bayesian settings it is not a necessary condition and for cases with non explicit conjugate priors, all Bayesian and/or assimilation tools (e.g., via sampling algorithms such as Markov Chain Monte Carlo methods (Robert and Casella, 2013) or non-parametric approaches such as Dirichlet processes) could be used to draw samples from [X|Y ] and estimate the distribution [s 0 (F, X)|Y = y] for a given s 0 (F, .). Finally, in the following we assume that distributions have known parameters; however this assumption can be relaxed by considering prior distributions on each involved parameter, via hierarchical Bayesian modeling. The scope of this paper is to provide a conceptual tool and the question of parameters estimation, besides the example in Section 5, is not treated in detail. In Section 5, we apply the proposed score derivation to the aforementioned surface wind data example described in the introduction. In Section 6, we discuss the challenges of computing scores as defined in Eqn.
(1) or in Ferro (2017) in more general cases of state-space models that are not limited to additive or multiplicative cases.

Gaussian additive case
As discussed earlier, a commonly used setup in applications is when errors are additive and their natural companion distribution is Gaussian. Hereafter, we derive the score from Eqn.
(1) for the commonly used log-score and CRPS in the Gaussian additive case, where the hidden state X and the verification data Y are linked through the following system: where Y is the observed verification data, X is the hidden true state, and all Gaussian variables are assumed independent. In the following, for simplicity we consider E(X) = E(Y ); however one can update the model easily to a mean-biased verification data Y . Parameters are supposed to be known from the applications, one could use priors on the parameters when estimates are not available. In the following, we express different versions of the logscore and CRPS: the ideal version, the used-in-practise version, and the error-embedding version from Eqn. (1). In this case, as well as in Section 4, since conditional distributions are expressed through Bayesian conjugacy, most computational efforts rely on integrating the scores against the conditional distributions.

Log-score versions
For a Gaussian predictive pdf f with mean µ and variance σ 2 , the log-score is defined by has been widely used in the literature. Ideally, one would access the true state X and evaluate forecast against its realizations x; however since X is not accessible scores are computed against observations y: Applying basic properties of Gaussian conditioning, our score defined by (1) can be written as: In particular,ȳ = E(X|Y = y) arises from the conditional expectation of s ( f, X) in (3) and is a weighted sum that updates the prior information about X ∼ N (µ 0 , σ 2 0 ) with the observation Y ∼ N (µ 0 , σ 2 0 + ω 2 ). In the following equations,ȳ represents the same quantity. Details of the computations are found in Appendix B and rely on the integration of Eqn.
As a comparison, the corrected score from Ferro (2017) expresses as s F (f, y) = log σ + (y − µ) 2 − ω 2 2σ 2 + 1 2 log 2π with the same notations and under the assumption [Y |X = x] ∼ N (x, ω 2 ). In this case, the distribution of the hidden state x is not involved; however y is not scale-corrected, only a location-correction is applied compared to Eqn. (5).

CRPS versions
Besides the logarithmic score, the CRPS is another classical scoring rule used in weather forecast centers. It is defined as where Z and Z are i.i.d. copy random variables with continuous pdf f . The CRPS can be rewritten as corresponds to the survival function associated to the cdf F . For example, the CRPS for a Gaussian forecast with parameters µ and σ is equal to where φ and Φ are the pdf and cdf of a standard normal distribution (Gneiting et al., 2005;Taillardat et al., 2016). Similarly to (4), in practice one evaluates (7) against observations y since the hidden state X is unobserved.
Under the Gaussian additive model (A), the proposed CRPS defined by Eqn.
(1) is written as where σ 2 0 +ω 2 as defined above. Details of the computations are found in Appendix C and rely on similar principles as the above paragraph.

Scores distributions
Under the Gaussian additive model (A), the random variables associated with the proposed log-scores defined by (4) and (5) is written as where d = means equality in distribution and χ 2 0 and χ 2 ∨ noncentral chi-squared random variables with 1 d.o.f. and respective non-centrality parameters λ 0 and λ ∨ . The explicit expressions of the constants λ 0 , λ ∨ a 0 , a ∨ , b 0 and b ∨ are found in Appendix D. The distribution of s 0 (f, X) can be derived similarly. Figure 3 illustrates the distributions in (9) for various values of the noise parameter ω. The distributions are very peaked due to the single degree of freedom of the Chi-square distribution, moreover their bulks are far from the true mean of the ideal score of s 0 (., X) challenging the use of the mean score to compare forecasts. The concept of propriety is based on averaging scores; however the asymmetry and long right tails of the noncentral chi-squared densities makes the mean a non-reliable statistic to represent such distributions. Bolin and Wallin (2019) discussed the misleading use of averaged scores in the context of time-varying predictability where different scales of prediction errors arise generating different of order of magnitude of evaluated scores. However, the newly proposed scores exhibit a bulk of their distribution closer to the mean and with a reduced variance as stated in Eqn. (2), leading to more confidence in the mean when this latter is consider.
Similarly, under the additive Gaussian model (A), the random variable associated with the proposed CRPS defined by (8) is written as where σ 2 ω = σ 2 + ω 2 σ 2 0 σ 2 0 +ω 2 and the random variableȲ = ω 2 σ 2 0 +ω 2 µ 0 + σ 2 0 σ 2 0 +ω 2 Y follows a Gaussian pdf with mean µ 0 and variance σ 2 0 × σ 2 0 σ 2 0 +ω 2 . The distribution of (10) does not belong to any known parametric families; however it is still possible to characterize the score distribution through sampling when the distribution of Y is available. Finally, having access to distributions like (9) or (10) gives access to the whole range of uncertainty of the score distributions helping to derive statistics that are more representative than the mean as pointed out above, and to compute confidence intervals without bootstrapping approximations as in (Wilks, 2010;Dirkson et al., 2019). Finding adequate representatives of a score distribution that shows reliable discriminative skills is beyond the scope of this work. Nevertheless, in Appendix G we take forward the concept of score distributions and apply it to computing distances between score distributions in order to assess their discriminative skills.  (9) of the log-score used in practice s 0 (., Y ) (green) and the proposed score s ∨ (., Y ) (red) computed for the perfect forecast f 0 on the left (µ = µ 0 = 0 and σ = σ 0 = 2) and imperfect forecasts f on the right (µ = 1 and σ = 3). The mean of the ideal score is depicted with an orange line: E(s 0 (f 0 , X)) on the left and E(s 0 (f, X)) on the right. The following distributions are used: X ∼ N (0, 4) and Y ∼ N (0, 4 + ω 2 ) with several levels of observational noise ω 2 = 0.5, 1, 3.

Multiplicative Gamma case
The Gaussian assumption is appropriate when dealing with averages, for example, mean temperatures; however, the normal hypothesis cannot be justified for positive and skewed variables such as precipitation intensities. For instance, multiplicative models are often used in hydrology to account for various errors and uncertainty (measurement errors, process uncertainty and unknown quantities), see (Kavetski et al., 2006a,b;McMillan et al., 2011). An often-used alternative in such cases is to use a Gamma distribution, which works fairly well in practice to represent the bulk of rainfall intensities. Hence, we assume in this section that the true but unobserved X follows a Gamma distribution with parameters α 0 and For positive random variables such as precipitation, additive models cannot be used to introduce errors. Instead, we prefer to use a multiplicative model of the type where is a positive random variable independent of X. To make feasible computations, we model the error as an inverse Gamma pdf with parameters a and b: for u > 0. The basic conjugate prior properties of such Gamma and inverse Gamma distributions allows us to easily derive the pdf [X|Y = y]. Analogously to Section 3, we express the log-score and CRPS within this multiplicative Gamma model in the following paragraphs.

Log-score versions
Let us consider a Gamma distribution f with parameters α > 0 and β > 0 for the prediction. With obvious notations, the log-score for this forecast f becomes Under the Gamma multiplicative model (B), the random variable associated with the corrected log-scores defined by Eqn. (1) and (12) is expressed as where ψ(x) represents the digamma function defined as the logarithmic derivative of the Gamma function, namely, ψ(x) = d log Γ(x)/dx. Details of the computations are found in Appendix E.

CRPS versions
For a Gamma forecast with parameters α and β, the corresponding CRPS (see, e.g., Taillardat et al., 2016;Scheuerer and Möller, 2015) is equal to Under the multiplicative Gamma model (B), the random variable associated with the CRPS (14) corrected by Eqn.
(1) is expressed as Details of the computations are found in Appendix F. Similarly to Section 3.3, one can access the range of uncertainty of the proposed scores (13) and (15) when sampling from the distribution of Y is available. As an illustration, Figure 4 shows the distributions of the three CRPS presented in this section. Similarly to the previous section, one can see the benefits of embedding the uncertainty of the verification data are noticeable in the variance reduction of distributions shaded in red and the smaller distance between the bulk of distributions in red and the mean value of the ideal score.

Applications and illustrations
The following section applies and illustrates through simulation studies the benefit of accounting for uncertainty in scores as presented in Section 2 through the power analysis of a hypothesis test and the numerical illustration of the motivating example with wind data from Section 1. In Appendix G, we illustrate further the consideration of score distributions via the Wasserstein distance.  (14) and (15). The mean of the ideal score E(c 0 (., X)) is depicted with an orange line. Left: score distributions for the perfect forecast f 0 having the same distribution as X (α = α 0 = 7 and β = β 0 = 2) validated against X and corrupted verification data Y with different levels of error: a = 7, 9 and b = 8. Right: score distributions for the imperfect forecast f with distribution parameters α = 4 and β = 1 validated against X and corrupted verification data Y . The following parameters α 0 = 7, β 0 = 2 are used for the hidden state X.

Power analysis of hypothesis testing
In this section, a power analysis of the following hypothesis test is performed on simulated data following Diebold and Mariano (2002) in order to investigate the discriminative skills of the proposed score from Eqn. (1). In Diebold and Mariano (2002), hypothesis tests focused on discriminating forecasts from the climatology; in the current study, we test the ability of the proposed score to discriminate any forecast f from the perfect forecast f 0 (forecast having the "true" hidden state X distribution). We consider the reference score value of the perfect forecast f 0 being the mean score evaluated against the true process E(s 0 (f 0 , X)). Hypothesis tests are tuned to primarily minimize the error of type I (wrongly rejecting the null hypothesis), consequently it is common practice to assess the power of a test. The power p is the probability of rejecting a false null hypothesis, the power expresses as 1 − β where β is the error of type II (failing to reject a false null hypothesis) and expresses as p = P (Rejecting H 0 |H 1 true).
The closer to 1 the power is, the better the test is a detecting a false null hypothesis. For a given forecast f , the considered hypothesis are expressed as H 0 : Forecast f is perfect (f = f 0 ) through the score s leading to E(s(f, .)) close to E(s 0 (f 0 , X)), H 1 : Forecast f is imperfect leading to E(s(f, .)) far from E(s 0 (f 0 , X)) (16) where f 0 and f are respectively the perfect forecast and an imperfect forecast to be compared with the perfect forecast f 0 . In the following, the score s(f, .) will represent respectively s 0 (f, X), s 0 (f, Y ) and s ∨ (f, Y ) in order to compare the ability of the ideal score, the score used in practice and the proposed score. The parameters associated with the hypothesis are the parameters, µ and σ in the following additive Gaussian model, of the imperfect forecast f and they are varied to compute the different powers. The statistical test corresponding to (16) expresses as where t = |E(s(f, .)) − E(s 0 (f 0 , X))| is the test statistics and c is defined via the error of type I P (t > c|H 0 is true) = α with the level α = 0.05 in the present study.
To illustrate this test with numerical simulations, the additive Gaussian model (A) is considered for the log-score, where the forecast, X and Y are assumed to be normally distributed with an additive error. The expectation in (17) is approximated and computed over a N = 1000 verification data-points, being the true state X and the corrupted data Y . The approximated test statistic is denoted witht. The power p expresses as P (t > c|f is imperfect) and is computed over 10000 samples of length N when parameters µ and σ of the forecast f are varied.
In Figure 5, the above power p is shown for varying mean µ and standard deviation σ of the forecast f in order to demonstrate the ability of the proposed score s ∨ (., Y ) to better discriminate between forecasts than the commonly used score s 0 (., Y ). One expects the power to be low around respectively µ 0 and σ 0 , and as high as possible outside these values. We notice that the ideal score s 0 and the proposed score s ∨ have similar power for the test (17) suggesting similar discriminating skills for both scores. However, the commonly used in practice score s 0 (., Y ) results in an ill-behaved power as the observational noise increases (from left to right), indicating the necessity to account for the uncertainty associated with the verification data. The ill-behaved power illustrates the propensity of the test based on s 0 (., Y ) to reject H 0 too often and in particular wrongfully when the forecast f is perfect and equals f 0 . In addition, the power associated with the score s 0 (., Y ) fails to reach the nominal test level α due to the difference in means between s 0 (., X) and s 0 (., Y ) (E(s 0 (f, X)) = E(s 0 (f, Y )) for any forecast f ) caused by the errors in the verification data Y . This highlights the unreliability of scores evaluated against corrupted verification data. Both varying mean and standard deviation reveal similar conclusions regarding the ability of the proposed score to improve its discriminative skills over a score evaluated on corrupted evaluation data.

Application to wind speed prediction
As discussed in the motivating example of Section 1, we consider surface wind speed data from the previous work (Bessac et al., 2018) and associated probabilistic distributions. In Bessac et al. (2018), a joint distribution for observations, denoted here X ref , and NWP model outputs is proposed and based on Gaussian distributions in a space-time fashion. This joint model aimed at predicting surface wind speed based on the conditional distribution of X ref given NWP model outputs. The data are Box-Cox-transformed in this study to approximate normal distributions. The model was fitted by maximum likelihood for an area covering parts of Wisconsin, Illinois, Indiana and Michigan in the United States. In this study, we focus on one station in Wisconsin and recover the parameters of its marginal joint distribution of observations X ref and NWP outputs from the joint spatio-temporal model. In the following, we evaluate with scores the probability distribution of the NWP model outputs depicted in , ω 2 = 0.5 (central) and ω 2 = 1 (right). In the simulations, the true state X is distributed as N (µ 0 = 0, σ 2 0 = 4) and Y is distributed as N (0, σ 2 0 + ω 2 ). The power expected to be low around µ 0 (resp. σ 0 ) and as large as possible elsewhere.
Mean Score

NWP Prediction
Log-score 0.73 (0.48) Table 1: Average scores (log-score and CRPS) computed for the predictive distribution of NWP model, the associated standard deviation is given in parenthesis. The statistics are computed over the entire studied time series. The mean ideal score E(s 0 (f, X)), the averaged score computed in practice against the measurements E(s 0 (f, Y )), and the proposed score E(s ∨ (f, X)) embedding the error in the verification data are computed.
observations X ref ; however in the validation step of the predictive probabilistic model, the observations shown in Figure 2 were used without accounting of their potential uncertainty and error. This leads to a discrepancy between the target variable X ref and the verification data that we denote as Y obs . From Pinson and Hagedorn (2012), a reasonable model for unbiased measurement error in wind speed is obs ∼ N (0, 0.25). Subsequently to Section 3, we proposed the following additive framework to account for the observational noise in the scoring framework: where µ 0 = 2.55 and σ 0 = 1.23 from the fitted distribution in (Bessac et al., 2018). In Table 1, log-scores and CRPS are computed as average over the studied time series in the additive Gaussian case with previously given formula in Section 3. The variance associated with each average score is provided in parenthesis. Table 1 shows a significant decrease of the variance when the proposed score is used compared the commonly used in practice score that does not account for measurement error. One can notice that the variance of the scores used in practice are considerably high limiting the reliability of these computed scores for decision-making purposes. Additionally, the new mean scores are closer to the ideal mean log-score and CRPS, showing the benefit of accounting for observational errors in the scoring framework. Figure 6 shows the pdf of the scores considered in Table 1, the skewness and the large dispersion in the upper tail illustrates with wind speed data cases where the mean is potentially a not informative summary statistics of the whole distribution. The non-corrected version of the score has a large variance, raising the concern of reliability on scores when computed on error-prone data.

Discussion
We have quantified, in terms of variance and even distribution, the need to account for the error associated with the verification data when evaluating probabilistic forecasts with scores. An additive framework and a multiplicative framework have been studied in detail to account for the error associated with verification data. Both setups involve a probabilistic model for the accounted errors and a probabilistic description of the underling non-observed Figure 6: Empirical distribution of uncorrected (green) and corrected scores (red) for the log-score (left) and CRPS (right) for the probabilistic distribution NWP model 1 evaluated against observations tainted with uncertainty. The ideal mean score value is illustrated by the orange vertical line.
physical process. Although we look only at idealized cases where the involved distributions and their parameters are known, this approach enables us to understand the importance of accounting for the error associated with the verification data.
Scores with hidden states The proposed scoring framework was applied to standard additive and a multiplicative models for which, via Bayesian conjugacy, the expression of the new scores could be made explicit. However, if the prior on X is not conjugate to the distribution of Y , one can derive approximate the conditional distribution [X|Y = y[ through sampling algorithms such as Markov Chain Monte Carlo methods. Additionally, one can relax the assumption of known parameters for the distributions of [X] and [Y |X] by considering priors on the parameters (e.g. priors on µ 0 and σ 0 ). In practice, we rely on the idea that the climatology and-or information on measurement errors can help expressing the distribution of X or its priors.
In the current literature on scoring with uncertain verification data, most proposed works rely on additive errors as in (Ciach and Krajewski, 1999;Saetra et al., 2004;Mittermaier and Stephenson, 2015). Multiplicative error models for various error representations have been proposed in modeling studies, such as for precipitation (McMillan et al., 2011), but have not been considered in scoring correction frameworks. Furthermore, additive and multiplicative state-space models can be generalized to non-linear and non-Gaussian state-space model specifications, see Chapter 7 of (Cressie and Wikle, 2015) for a discussion and examples. More generally, one could consider the following state-space model specification where f and g are non-linear functions describing the hidden state and the state-observation mapping, η and are stochastic terms representing respectively the stochasticity of the hid-den state and the observational error. This generalized specification of the state-observation model could be integrated in future work in the proposed scoring framework via Bayesian specifications of the new score in order to account for prior information on the verification data Y and its uncertainty and priors on the true state X.
Scores as random variables Finally, the study raises the important point of investigating the distribution of scores when the verification data is considered to be a random variable. Indeed, investigating the means of scores may not provide sufficient information to compare between score discriminative capabilities. This has been pointed and investigated in (Taillardat et al., 2019) in the context of evaluating extremes. This topic has also been extensively discussed by Bolin and Wallin (2019) in the context of varying predictability that generates non-homogeneity in the score values that is poorly represented by an average. One could choose to take into account the uncertainty associated with the inference of distribution parameters and compute different statistics of the distribution rather than the mean. In this case, a Bayesian setup similar to the current work could elegantly integrate the different hierarchies of knowledge, uncertainties and a priori information to generalize the notion of scores.

Codes and data
Codes and data used for this study can be found at https://github.com/jbessac/uncertainty_ scoring.
In our case, the variable U = s o (f, X) and s ∨ (f, To show inequality (2), we use the classical variance decomposition With our notations, we have This leads to

B Proof of Equation (5)
To express the score proposed in (1), one needs to derive the conditional distribution [X|Y = y] from Model (A). More precisely, the Gaussian conditional distribution of X given Y = y is equal to

Combining this information with Equations (1) and (3) leads to
s ∨ (f, y) = log σ + 1 2σ 2 E (X − µ) 2 |Y = y + 1 2 log 2π, By construction, we have This means that, to obtain the right score value, we can first computeȲ as the best estimator of the unobserved X and then use it into in the corrected score s ∨ (f,Ȳ ).

C Proof of Equation (8)
To compute the corrected CRPS, one needs to calculate the conditional expectation of c 0 (f, X) under the distribution of [X|Y = y]. We first compute the expectation E(c 0 (f, X)) and then substitute X byȲ and its distribution with mean a =ȳ and standard deviation b = ω 2 σ 2 0 σ 2 0 +ω 2 . From Equation (7) we obtain If X follows a normal distribution with mean a and variance b 2 , that is, X = a + bZ with Z a standard random variable, then we can define the continuous function . Then, we apply Stein's lemma (Stein, 1981), which states E [h (Z)] = E [Zh(Z)] , because Z is a standard random variable. It follows with the notations t = b 2 2σ 2 and λ = a−µ σ that where Z has a standard normal distribution is a noncentered chi-square distribution with one degree of freedom and a noncentral parameter ( a−µ b ) 2 with known moment generating function We obtain The expression of (8) is obtained by substituting X byȲ and its Gaussian distribution with mean a =ȳ and standard deviation b = ω 2 σ 2 0 σ 2 0 +ω 2 in the expression (6). This gives c ∨ (f,ȳ) = E(c 0 (f, X)|Y = y)

E Proof of Equation
It follows that the proposed corrected score is represents the digamma function defined as the logarithmic derivative of the gamma function, namely, ψ(x) = d log Γ(x)/dx.

G Additional results: distance between scores distributions
In order to further study the impact of imperfect verification data and to take full advantages of the score distributions we compute the Wasserstein distance (Muskulus and Verduyn-Lunel, 2011;Santambrogio, 2015;Robin et al., 2017) between several scores distributions and compare it to the commonly used score average. In particular, through their full distributions we investigate the discriminative skills of scores compared to the use of their mean only. The p-Wasserstein distance between two probability measures P and Q on R with finite p-moments is given by  Figure 7: Left: Mean log-score for an imperfect forecast f minus the mean log-score of the perfect forecast f 0 when evaluated against perfect data X, the imperfect forecast f ∼ N (µ, σ) has varying mean µ (x-axis) and varying standard deviation σ (y-axis). Right: Relative difference between E(s 0 (f, X)) and E(s 0 (f, Y )).
where Γ(P, Q) is the set of all joint probability measures on R × R whose marginals are P and Q. In the one-dimensional case as here, the Wasserstein distance can be computed as W p (F, G) = 1 0 |F −1 (u) − G −1 (u)| p du 1/p , with F and G the cdf of P and Q to be compared, F −1 and G −1 their generalized inverse (or quantile function) and in our case p = 1. The R-package transport (Schuhmacher et al., 2020) is used to compute Wasserstein distances. Figure 7 shows the mean of the log-score minus its minimum E(s 0 (f 0 , X)) and the relative difference between the ideal mean log-score and the mean log-score evaluated against imperfect verification data. One can first observe the flatness of the mean log-score around its minimum indicating a lack of discriminative skills of the score mean when comparing several forecasts. Secondly, the discrepancy between the score evaluated against perfect and imperfect verification data indicates the effects of error prone verification data as discussed earlier. Figure 8 shows the Wasserstein distance between the distributions of three scores evaluated in different contexts (a perfect forecast and an imperfect forecast, true and error-prone verification data). Three different log-scores are considered the ideal log-score s 0 (., X), the log-score used in practice s 0 (., Y ) and the proposed corrected score s ∨ (., Y ). One can notice that Wasserstein distances exhibit stronger gradients (contour lines are narrower) than the mean log-score in Figure 7. In particular, the surfaces delimited by a given contour level are smaller for the proposed score than for the other scores, for instance the area inside the contour of level z = 0.1 is larger for the mean log-score in Figure 7 than for the Wasserstein distance between score distance. This indicates that considering the entire score distribution have the potential to improve the discriminative skills of scoring procedures. In particular, imperfect forecasts departing from the perfect forecast will be more sharply discriminated with the Wasserstein distance computed on score distributions.
Finally, when considering Wasserstein distances associated with the score evaluated on imperfect verification data the minimum of the distances (indicated by white crosses '×' on the central and right panels) is close to the 'true' minimum (intersection of x = µ 0 and y = σ 0 ). This indicates some robustness of the Wasserstein distance between the score distributions when errors are present in the verification data. Similar results are obtained W 1 (s 0 (f 0 , X), s 0 (f, X))  )) between log-scores s 0 distributions evaluated at the perfect forecast f 0 and the imperfect forecast f with varying predictive mean µ (x-axis) and varying predictive standard deviation σ (y-axis). From left to right, log-scores are evaluated against the hidden true state X via s 0 (f, X), against Y tainted by observational noise of level ω 2 = 1 via s 0 (f, Y ) and through the corrected log-score version s ∨ (f, Y ). The verification data X and perfect forecast f 0 are distributed according to f 0 ∼ N (0, 4). On the central and right surfaces, the white cross '×' indicates the numerical minimum of each surface.
for the CRPS and not reported here. As stated earlier, developing metrics to express the discriminative skills of a score is beyond the extent of this work.