On inference of boxplot symbolic data:  applications in climatology

Sadeghkhani, Abdolnasser; Sadeghkhani, Ali

doi:https://doi.org/10.5194/ascmo-11-73-2025

Articles | Volume 11, issue 1

https://doi.org/10.5194/ascmo-11-73-2025

Articles | Volume 11, issue 1

17 Mar 2025

| 17 Mar 2025

On inference of boxplot symbolic data: applications in climatology

Abdolnasser Sadeghkhani and Ali Sadeghkhani

Abstract

This paper presents a pioneering study on the inference of boxplot-valued data using both Bayesian and frequentist approaches within a multivariate framework. This approach leverages complex yet intuitive representations to make large datasets more manageable and enhance their interpretability, which is invaluable in the age of big data. Boxplot-valued data are particularly important due to their ability to capture the inherent variability and distributional characteristics of complex datasets.

In our study, we propose novel methodologies for parameter estimation and density estimation for boxplot-valued data and apply these techniques to climatological data. Specifically, we utilize data from the Berkeley Earth Surface Temperature Study, which aggregates 1.6 billion temperature reports from 16 pre-existing archives affiliated with the Lawrence Berkeley National Laboratory. Our methods are validated through extensive simulations comparing the efficiency and accuracy of Bayesian and frequentist estimators.

We demonstrate the practical applicability of our approach by analyzing summer average temperatures across various European countries. The proposed techniques provide robust tools for analyzing complex data structures, offering valuable insights into climatic trends and variations. Our study highlights the advantages and limitations of each inferential method, offering guidance for future research and applications in the field of climatology.

Download & links

Article (PDF, 833 KB)

Download & links

How to cite.

Received: 20 Aug 2024 – Revised: 22 Jan 2025 – Accepted: 24 Jan 2025 – Published: 17 Mar 2025

1 Introduction

Data analysts are now encountering increasingly large datasets whose analysis, spanning decades or even centuries, presents a complex challenge that demands more in-depth attention. There are many instances where complex and vast information, due to its intrinsic structure, cannot be adequately represented as single-valued data. As a result, symbolic data analysis (SDA) has emerged as a crucial approach, integrating elements of data science, multivariate analysis, pattern recognition, data mining, and artificial intelligence to analyze such data properly without losing information. Initially conceptualized by Diday (1988) and later formalized by Billard and Diday (2003), SDA represents a significant advancement in managing, reducing, and interpreting large datasets.

The initial steps of summarizing data to facilitate a clustering process applied to a large database appeared in the work of Diday and Noirhomme-Fraiture (2008), which, by introducing the cluster as a category, highlighted the importance of data summarization using SDA. To manage data effectively, SDA works without losing information, a common issue in traditional analyses, and encompasses various data formats, including intervals, sets, lists, histograms, trees, boxplots, and other distributional representations. By employing these complex yet intuitive representations, SDA not only makes vast amounts of data more manageable but also enhances their comprehensibility, proving to be an invaluable tool in the age of big data.

The importance of boxplot-valued data in real-world applications is particularly evident in climate science, where data often span long periods and involve significant uncertainty. For example, long-term datasets such as historical temperature records or remote sensing data generate vast volumes of observations that are often summarized as medians, quartiles, and extreme values for computational efficiency. These boxplot-valued summaries enable researchers to analyze trends in temperature variability, extreme events, or regional shifts in central tendencies, all of which are critical for understanding climate dynamics and making policy decisions.

By expanding data and converting their form from single-valued to symbolic (e.g., multivalued data such as list data, model-valued data such as histograms, or interval-valued data), the need for further theoretical development of SDA in each of its types has emerged. Diday (1995), Émilion (1997), and Diday (1988) provided the foundational mathematical background for various forms of symbolic data.

Among symbolic data types, interval-valued data have received significant attention due to their simplicity, versatility, and prevalence in various applied sciences compared to other symbolic data types. These data types, which range from univariate to multivariate, address the needs of real-world applications. In recent years, both frequentist and Bayesian approaches to interval-valued data estimators have been explored (e.g., Xu and Qin, 2022; Samadi et al., 2024; Sadeghkhani and Sadeghkhani, 2024). Additionally, various statistical tools for investigating relationships within interval-valued data have been developed. For instance, principal component analysis (PCA) using vertices to represent intervals was proposed by Douzal-Chouakria et al. (2011), a regression method that avoids the center and range approach was introduced by Billard and Diday (2000) and later by Neto and De Carvalho (2008) and Neto and De Carvalho (2010), and interval-valued time series methods were implemented by Xiong et al. (2015).

Beyond interval-valued data, boxplot-valued data provide additional distributional insights by including medians, quartiles, and extremes, making them well suited to summarizing massive datasets. In climate science, examples include datasets where raw observations (e.g., hourly or daily temperatures) are summarized into annual or monthly boxplots for computational feasibility. Historical datasets from pre-digital eras often come as summarized statistics (e.g., medians and extremes), making boxplot-valued data analysis indispensable. Such datasets allow for the study of trends in variability and centrality, which are critical for analyzing long-term climate change impacts.

The relevance of boxplot-valued data is also evident in studies of regional variability, where data collection is aggregated to minimize measurement noise. For example, climate model datasets often summarize temperature and precipitation data across spatial grids, which could naturally be represented and analyzed as boxplot-valued data. This approach reduces computational complexity while preserving the essential distributional features of the data.

Through this work, we aim to develop a methodological framework for analyzing boxplot-valued data and providing tools that are computationally efficient and applicable to a wide range of fields, including but not limited to climate science. By addressing the unique challenges posed by such data, we contribute to the broader goal of managing and interpreting massive datasets in the applied sciences.

While interval-valued data capture the range of variability, boxplot-valued data provide a richer summary by including additional distributional characteristics such as quartiles and the median. Boxplots are widely used for comparing distributions across different groups, detecting outliers, and understanding the spread and symmetry of the data. The concept of boxplots was first proposed by Tukey (1977) and further developed by Benjamini (1988). However, the mathematical foundation for statistical inference using boxplot-valued data has not been assessed thoroughly.

Though boxplots are valuable for comparing distributions across different groups, detecting outliers, and understanding the spread and symmetry of the data, their mathematical foundation has not been assessed thoroughly. They are widely used in various fields due to their simplicity and effectiveness in summarizing complex datasets. The concept of boxplots was first proposed by Tukey (1977) and was developed by Benjamini (1988). The importance of data visualization was further emphasized by Chambers (2018), who explored graphical methods for data analysis. Wickham and Wickham (2016) focused on modern approaches to creating boxplots using the ggplot2 package in R, highlighting practical aspects of their implementation.

To compare the role of boxplot-valued data, Arroyo et al. (2006) contrasted these types of data with interval-valued data and histograms. Boxplot variables serve as an intermediate point between the simplicity of interval variables and the detailed information provided by histogram variables. While interval variables do not convey information about the central area of an empirical distribution, boxplot variables do so using three quartiles. In contrast, histogram variables offer detailed insights into the empirical distribution, though their structure is more complex, requiring a set of consecutive intervals with associated weights. Despite their simpler structure, boxplot variables effectively capture the shape of the distribution.

To the best of our knowledge, there are no specific mathematical considerations for boxplot-valued data parameter estimation in the context of symbolic data, whether from a frequentist or Bayesian perspective. However, Reyes et al. (2024) proposed a parameterized regression method for boxplot-valued data by applying a Box–Cox transformation. Additionally, Reyes et al. (2022) focused on forecasting time series of these types of data.

Let X_i be a boxplot-valued observation summarized by its five-number summary statistics: the minimum (a_i), first quartile (q_1i), median (m_i), third quartile (q_3i), and maximum (b_i). This structure effectively captures the distributional characteristics of the data for each observation. To facilitate more precise parameter estimation, we decompose X_i into two components, i.e., $Q_{i} = (q_{1 i}, m_{i}, q_{3 i})^{⊤}$ , which focuses on the central tendency and the shape of the distribution, and $R_{i} = b_{i} - a_{i}$ , which represents the range, thus capturing the overall spread of the data. By separating these components, we are able to model the central distribution and variability independently, thereby enhancing the accuracy of our parameter estimates.

While Le-Rademacher and Billard (2011) focused on univariate interval-valued data, modeling the midpoints in a frequentist framework, we extend their methodology to handle boxplot-valued data. They assumed that the midpoints of intervals are normally distributed and derived maximum likelihood estimators for the mean and variance. By assuming a uniform distribution within each interval, we leverage the properties outlined by Le-Rademacher and Billard to justify the normality of $Q_{i} = [q_{1 i}, m_{i}, q_{3 i}]^{⊤}$ , where, for instance, q_1i is the midpoint between a_i and m_i=q_2i.

However, it is important to note that this extension from univariate intervals to multivariate quartiles relies on specific assumptions. These include the assumption that the quartiles are derived from sufficiently large datasets where the central limit theorem (CLT) ensures the approximation of normality. Additionally, we assume that the data within the intervals are uniformly distributed. While this assumption is commonly employed in interval-valued data analysis, deviations from uniformity or smaller datasets could impact the validity of the multivariate normal model. In such cases, a larger Q_i dataset would be required to approximate normality using the CLT.

In our approach, we model Q_i as following a trivariate normal distribution 𝒩₃(μ,Σ), where $μ = (μ_{q_{1}}, μ_{m}, μ_{q_{3}})^{⊤}$ is the mean vector and Σ is the covariance matrix. This extension allows us to capture more detailed distributional characteristics inherent in boxplot data, offering a balance between theoretical rigor and practical applicability.

Therefore, we have

\begin{matrix} (1) & \begin{aligned} f (Q_{i}; μ, Σ) & = \frac{1}{(2 π)^{3 / 2} | Σ |^{1 / 2}} \\ \cdot \exp (- \frac{1}{2} (Q_{i} - μ)^{⊤} Σ^{- 1} (Q_{i} - μ)) \end{aligned} \end{matrix}

and

\begin{matrix} (2) & \begin{aligned} f (R_{i}; R_{\min}, R_{max}) = \frac{1}{R_{max} - R_{\min}} \\ for R_{\min} \leq R_{i} \leq R_{max}, \end{aligned} \end{matrix}

where we assume that Q_i (the quantiles) and R_i (the range) are conditionally independent under the assumption that the internal distribution (quantiles) of the interval is not directly driven by the overall width of the interval.

It is worth noting that, in our model, the range $R_{i} = b_{i} - a_{i}$ is assumed to follow a uniform distribution over the interval $[R_{\min}, R_{max}]$ . This assumption is commonly employed in the context of interval-valued data analysis (e.g., Neto and De Carvalho, 2008; Zhao et al., 2023). The uniform distribution is chosen due to its simplicity and practical effectiveness, particularly when there is no prior information suggesting a different underlying distribution. The bounds R_min and R_max are determined empirically based on the observed data, ensuring that they capture the full range of variability in our dataset.

Note that, by defining the scaling parameter λ, the bounds a_i and b_i are hence related to the quantiles q_1i and q_3i as

\begin{matrix} (3) & a_{i} = q_{1 i} - λ (1 - \frac{{IQR}_{i}}{R_{i}}) R_{i}, \\ (4) & b_{i} = q_{3 i} + (1 - λ) (1 - \frac{{IQR}_{i}}{R_{i}}) R_{i}, \end{matrix}

where ${IQR}_{i} = q_{3 i} - q_{1 i}$ , which is known as the interquartile range.

The rest of the paper is organized as follows. In Sect. 2, we elaborate on the maximum likelihood (ML) method for estimating the unknown parameters of a p-variate boxplot-valued random variable where p≥1. This section also includes a simulation to assess this method as well as the asymptotic distribution of the ML estimators. Section 3 deals with Bayesian estimation of the parameters and evaluates the proposed methods. Section 4 investigates the methods of density estimation in boxplot-valued data, ranging from plugin types to posterior predictive density estimators. In Sect. 5, we illustrate the practical utility and effectiveness of the proposed techniques in analyzing and interpreting complex boxplot-valued environmental data. Finally, we conclude with a discussion in Sect. 6.

2 Maximum likelihood estimators

The likelihood function of the unknown parameters $μ, Σ, R_{\min}, R_{max}$ , and λ based on a random sample of size n, $X_{i} = [a_{i}, q_{1 i}, m_{i}, q_{3 i}, b_{i}]$ for $i = 1, \dots, n$ , assuming that Eqs. (1) and (2) are conditionally independent despite the inherent relationship between R_i and Q_i, is given by

\begin{matrix} (5) & \begin{aligned} L (μ, Σ, R_{\min}, R_{max}, λ) & = \prod_{i = 1}^{n} f (Q_{i}; μ, Σ) \\ \cdot f (R_{i}; R_{\min}, R_{max}) . \end{aligned} \end{matrix}

It can easily be seen that the ML estimators of μ are the sample means of q_1i, m_i, and q_3i, i.e.,

\begin{matrix} (6) & {\hat{μ}}_{q_{1}} = \frac{1}{n} \sum_{i = 1}^{n} q_{1 i}, {\hat{μ}}_{m} = \frac{1}{n} \sum_{i = 1}^{n} m_{i}, {\hat{μ}}_{q_{3}} = \frac{1}{n} \sum_{i = 1}^{n} q_{3 i} . \end{matrix}

The ML estimator of Σ is the sample covariance matrix

\hat{Σ} = \frac{1}{n - 1} \sum_{i = 1}^{n} (Q_{i} - \hat{μ}) (Q_{i} - \hat{μ})^{⊤},

while the ML estimators for R_min and R_max are

{\hat{R}}_{\min} = \min_{i = 1}^{n} R_{i}, {\hat{R}}_{max} = {max}_{i = 1}^{n} R_{i} .

The scaling parameter λ is estimated by minimizing an objective function. That is,

\hat{λ} = \arg \min_{λ} \sum_{i = 1}^{n} {(R_{i} - (q_{3 i} + (1 - λ) (1 - \frac{{IQR}_{i}}{R_{i}}) R_{i}))}^{2} .

Our assumption of conditional independence between R_i and Q_i is made to simplify the estimation process, recognizing that R_i and Q_i are functionally related via the quartiles and the scaling parameter λ. However, by assuming that λ is constant across observations, we reduce the direct influence of Q_i on R_i, making this approximation reasonable in practice.

The primary reason we applied an objective function for minimizing λ was to directly assess how well the model captures the relationship between the observed ranges R_i and the estimated upper quantiles q_3i. Furthermore, while maximum likelihood estimation (MLE) is a robust method for parameter estimation, it relies on certain assumptions about the distributional properties of the data. Minimizing the objective function allows for greater flexibility in modeling this relationship without imposing strict distributional assumptions. Future work could explore models that explicitly incorporate this dependency, potentially enhancing the accuracy of the parameter estimates.

In the next subsection, we generalize the univariate X into the p-variate $X_{i} = [X_{i 1}, X_{i 2}, \dots, X_{i p}]$ with $i = 1, \dots, n$ for p>1, and we focus on estimating the multivariate boxplot data.

2.1 Multivariate ML estimators

Consider a p-variate boxplot-valued random variable $X_{i} = [X_{i 1}, X_{i 2}, \dots, X_{i p}]$ , where each $X_{i j} = [a_{i j}, q_{1 i j}, m_{i j}, q_{3 i j}, b_{i j}]$ for $i = 1, \dots, n$ and $j = 1, \dots, p$ . Moreover, we assume that

Q_{i} = (Q_{i 1}, \dots, Q_{i p})^{⊤} \sim N_{3 p} (μ, Σ),

such that

\begin{aligned} Q_{i j} = (q_{1 i j}, m_{i j}, q_{3 i j}) \sim N_{3} (μ_{j}, Σ_{j}) \\ for i = 1, \dots, n, j = 1, \dots, p, \end{aligned}

where $μ = (μ_{1}, \dots, μ_{p})^{⊤}$ is the mean vector with $μ_{j} = (μ_{q_{1 j}}, μ_{m j}, μ_{q_{3 j}})^{⊤}$ and

Σ = (\begin{array}{cccc} Σ_{1, 1} & Σ_{1, 2} & \dots & Σ_{1, p} \\ Σ_{2, 1} & Σ_{2, 2} & \dots & Σ_{2, p} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Σ_{p, 1} & Σ_{p, 2} & \dots & Σ_{p, p} \end{array})

is the covariance matrix, with Σ_k,l representing the covariance between Q_k and Q_l for $k, l = 1, \dots, p$ . Moreover, the range $R_{i j} = b_{i j} - a_{i j}$ is uniformly distributed over $[R_{\min, j}, R_{max, j}]$ .

In a similar fashion to the univariate case with p=1, to estimate the parameters in a multivariate setting, we maximize the likelihood function

\begin{matrix} (7) & \begin{aligned} L (μ, Σ, R_{\min}, R_{\max}, λ) & = \prod_{i = 1}^{n} f (Q_{i}; μ, Σ) \\ \cdot \prod_{j = 1}^{p} f (R_{i j}; R_{\min, j}, R_{\max, j}), \end{aligned} \end{matrix}

where $f (Q_{i}; μ, Σ)$ represents the multivariate normal density function and $f (R_{i j}; R_{\min, j}, R_{\max, j})$ is the density function of the uniform distribution for each component R_ij, with R_min and R_max denoting the vectors of the minimum and maximum values for each component j of the range.

This results in the ML estimator of the parameters as presented in the following theorem.

Theorem 1. Consider a p-variate boxplot-valued random variable $X_{i} = [X_{i i}, X_{i 2}, \dots, X_{i p}]$ , where $X_{i j} = [a_{i j}, q_{1 i j}, m_{i j}, q_{3 i j}, b_{i j}]$ and $μ_{j} = (μ_{q_{1 j}}, μ_{m j}, μ_{q_{3 j}})^{⊤}$ . Suppose that

Q_{i} = (Q_{i 1}, \dots, Q_{i p})^{⊤} \sim N_{3 p} (μ, Σ),

with $Q_{i j} = (q_{1 i j}, m_{i j}, q_{3 i j})^{⊤}$ for $i = 1, \dots, n$ and $j = 1, \dots, p$ . The ML estimators of the mean vector $\hat{μ} = ({\hat{μ}}_{1}, \dots, {\hat{μ}}_{p})^{⊤}$ , where ${\hat{μ}}_{j} = ({\hat{μ}}_{q_{1 j}}, {\hat{μ}}_{m j}, {\hat{μ}}_{q_{3 j}})^{⊤}$ , are given by

\begin{matrix} (8) & {\hat{μ}}_{q_{1 j}} = \frac{1}{n} \sum_{i = 1}^{n} q_{1 i j}, \\ (9) & {\hat{μ}}_{m j} = \frac{1}{n} \sum_{i = 1}^{n} m_{i j}, \\ (10) & {\hat{μ}}_{q_{3 j}} = \frac{1}{n} \sum_{i = 1}^{n} q_{3 i j} . \end{matrix}

The ML estimator of the covariance matrix Σ is

\begin{matrix} (11) & \hat{Σ} = \frac{1}{n - 1} \sum_{i = 1}^{n} (Q_{i} - \hat{μ}) (Q_{i} - \hat{μ})^{⊤} . \end{matrix}

The ML estimators of the range parameters are

\begin{matrix} (12) & {\hat{R}}_{\min, j} = \min_{i = 1}^{n} R_{i j}, \\ (13) & {\hat{R}}_{max, j} = {max}_{i = 1}^{n} R_{i j}, \end{matrix}

and the ML estimator of the scaling parameter λ_j is

\begin{matrix} (14) & \begin{aligned} {\hat{λ}}_{j} & = \arg \min_{λ_{j}} \sum_{i = 1}^{n} (R_{i j} - (q_{3 i j} + (1 - λ_{j}) \\ \cdot (1 - \frac{q_{3 i j} - q_{1 i j}}{R_{i j}}) \cdot R_{i j}))^{2} . \end{aligned} \end{matrix}

Proof. The proof involves straightforward maximization of the likelihood function. For brevity, it is omitted here.

□

The following subsection details the simulation setup for the proposed MLE method for symbolic boxplot-valued data.

2.2 Simulation study of ML estimators

To demonstrate the estimation procedure, we simulate data for p=3 variables with a sample size of n=100 and the following true parameters:

\begin{matrix} (15) & \begin{aligned} μ_{1} = (5, 10, 15)^{⊤}, μ_{2} = (4, 8, 12)^{⊤}, \\ μ_{3} = (6, 11, 16)^{⊤}, \end{aligned} \\ (16) & \begin{aligned} Σ_{1, 1} = [\begin{array}{ccc} 2 & 1 & 1 \\ 1 & 2 & 1 \\ 1 & 1 & 2 \end{array}], Σ_{1, 2} = [\begin{array}{ccc} 1.5 & 0.8 & 0.8 \\ 0.8 & 1.5 & 0.8 \\ 0.8 & 0.8 & 1.5 \end{array}], \\ Σ_{1, 3} = [\begin{array}{ccc} 2.5 & 1.2 & 1.2 \\ 1.2 & 2.5 & 1.2 \\ 1.2 & 1.2 & 2.5 \end{array}], \\ Σ_{2, 2} = [\begin{array}{ccc} 1.2 & 0.5 & 0.5 \\ 0.5 & 1.2 & 0.5 \\ 0.5 & 0.5 & 1.2 \end{array}], Σ_{2, 3} = [\begin{array}{ccc} 1.8 & 0.9 & 0.9 \\ 0.9 & 1.8 & 0.9 \\ 0.9 & 0.9 & 1.8 \end{array}], \\ Σ_{3, 3} = [\begin{array}{ccc} 2 & 0.9 & 0.9 \\ 0.9 & 2 & 0.9 \\ 0.9 & 0.9 & 2 \end{array}], \end{aligned} \end{matrix}

\begin{matrix} (17) & R_{\min, j} = 5, R_{max, j} = 15, j = 1, 2, 3 . \end{matrix}

The true values for the scaling parameters are λ₁=0.48, λ₂=0.49, and λ₃=0.51.

In the simulation, we generate the components Q_i and R_i separately to simplify the estimation. Specifically, $Q_{i} = (q_{1 i}, m_{i}, q_{3 i})$ is generated from a trivariate normal distribution 𝒩₃(μ,Σ) as outlined in Eq. (1), and $R_{i} = b_{i} - a_{i}$ is assumed to follow a uniform distribution over $[R_{\min}, R_{max}]$ , as given in Eq. (2). This approach enables independent handling of each component's characteristics.

The mean squared error (MSE) is computed for the mean vector μ and the covariance matrices Σ_j,k for $j, k = 1, 2, 3$ , as follows:

MSE ({\hat{μ}}_{j}) = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{μ}}_{j i} - μ_{j})^{2}, j = 1, 2, 3,

and

\begin{matrix} (18) & MSE ({\hat{Σ}}_{j, k}) = \frac{1}{n} \sum_{i = 1}^{n} \sum_{a = 1}^{3} \sum_{b = 1}^{3} ({\hat{Σ}}_{j, k}^{(i)} [a, b] - Σ_{j, k} [a, b])^{2}, \end{matrix}

where ${\hat{Σ}}_{j, k}^{(i)} [a, b]$ and $Σ_{j, k} [a, b]$ represent the (a,b)th elements of the estimated and true covariance matrices, respectively, in the ith simulation.

Table 1 presents the MLE parameters along with their MSEs. Note that the MSEs for λ_j ( $j = 1, 2, 3$ ) are reported as “NA”, indicating no variation in these parameters across simulations.

Table 1ML estimates and MSEs.

NA – no variation across simulations.

Download Print Version | Download XLSX

2.3 Asymptotic distribution of ML estimators for boxplot-valued data

In this subsection, we derive the asymptotic distribution of the ML estimators for the parameters of a boxplot-valued random variable.

Theorem 2. Let $\hat{θ} = (\hat{μ}, \hat{Σ}, \hat{λ})$ denote the ML estimators for the parameters $θ = (μ, Σ, λ)$ of boxplot-valued data. Under regularity conditions (see, e.g., Ferguson, 2017), the asymptotic distribution of the ML estimators is given as

\sqrt{n} (\hat{θ} - θ) \overset{d}{\to} N_{\frac{9 p^{2} + 11 p}{2}} (0, I (θ)^{- 1}),

where ℐ(θ) is the Fisher information matrix and is given by

I (θ) = [\begin{array}{ccc} I_{μ} & 0 & 0 \\ 0 & I_{Σ} & 0 \\ 0 & 0 & I_{λ} \end{array}]

with

\begin{array}{l} I_{μ} & = n I_{3 p} \otimes Σ^{- 1}, \\ I_{Σ} & = \frac{n}{2} (Σ^{- 1} \otimes Σ^{- 1}), \\ I_{λ} & = n diag (\frac{1}{λ_{1}^{2}}, \frac{1}{λ_{2}^{2}}, \dots, \frac{1}{λ_{p}^{2}}), \end{array}

where I_3p is the 3p×3p identity matrix and ⊗ denotes the Kronecker product.

Proof. The Fisher information matrix is the expected value of the outer product of the score function $\nabla ℓ (θ) = \frac{\partial ℓ (θ)}{\partial θ}$ , and it is given as

I (θ) = E [\nabla ℓ (θ) \nabla ℓ (θ)^{⊤}] .

For the multivariate normal distribution $Q_{i} \sim N_{3 p} (μ, Σ)$ , the score function for μ is

\frac{\partial ℓ}{\partial μ} = Σ^{- 1} \sum_{i = 1}^{n} (Q_{i} - μ),

and hence

I_{μ} = E [(\frac{\partial ℓ}{\partial μ}) {(\frac{\partial ℓ}{\partial μ})}^{⊤}] = n I_{3 p} \otimes Σ^{- 1} .

Similarly, for Σ, we have the score function

\frac{\partial ℓ}{\partial Σ} = \frac{1}{2} (Σ^{- 1} - Σ^{- 1} S Σ^{- 1}),

and therefore we have

\begin{aligned} I_{Σ} & = \frac{n}{2} E [Σ^{- 1} (Q_{i} - μ) (Q_{i} - μ)^{⊤} Σ^{- 1} \\ \otimes Σ^{- 1} (Q_{i} - μ) (Q_{i} - μ)^{⊤} Σ^{- 1}] . \end{aligned}

Since $E [(Q_{i} - μ) (Q_{i} - μ)^{⊤}] = Σ$ , we can write

I_{Σ} = \frac{n}{2} (Σ^{- 1} \otimes Σ^{- 1}) .

For the uniform distribution $R_{i j} \sim U (R_{\min, j}, R_{\max, j})$ , the Fisher information for the parameter λ_j (related to the range) is derived as

I_{λ} = n diag (\frac{1}{λ_{1}^{2}}, \frac{1}{λ_{2}^{2}}, \dots, \frac{1}{λ_{p}^{2}}) .

This expression reflects the dependence of the Fisher information on the individual λ_j values.

Using the central limit theorem, the score function evaluated at the true parameter values θ is asymptotically normally distributed as

\sqrt{n} \nabla ℓ (θ) \overset{d}{\to} N_{\frac{9 p^{2} + 11 p}{2}} (0, I (θ)) .

The covariance of the score function is related to the Fisher information matrix by the information matrix equality

Cov (\nabla ℓ (θ)) = I (θ) .

The dimension of the parameter vector $θ = (μ, Σ, λ)$ is given by

dim (θ) = 3 p + \frac{3 p (3 p + 1)}{2} + p = \frac{9 p^{2} + 11 p}{2} .

This completes the proof.

□

3 Bayesian estimation for boxplot-valued data

Sadeghkhani and Sadeghkhani (2024) studied the Bayesian inference of symbolic interval-valued data by introducing noninformative priors, including the Jeffreys prior, into the multivariate setting. In this section, we use informative priors for boxplot-valued data, which allows us to update our prior knowledge based on the observed data. We derive the posterior distributions and find the Bayes estimators under the squared error loss (SEL) criterion.

3.1 Likelihood and priors

Consider the model where $Q_{1}, Q_{2}, \dots, Q_{n}$ are independent and identically distributed (iid) from 𝒩_3p(μ,Σ) with additional parameters R_minj, R_maxj, and λ_j for $j = 1, 2, \dots, p$ included in the likelihood function $L (μ, Σ, {R_{\min j}}, {R_{\max j}}, {λ_{j}} ∣ Q)$ .

We assume the following priors for the parameters:

\begin{array}{l} (19) & μ ∣ Σ & \sim N_{3 p} (μ_{0}, Σ / κ), \\ (20) & Σ & \sim W_{3 p}^{- 1} (ν_{0}, Ψ_{0}), \end{array}

where μ₀ is the prior mean, κ is the scaling factor, ν₀ is the degrees of freedom, and Ψ₀ is the scale matrix of the inverse Wishart distribution. Additionally, we have

\begin{matrix} (21) & \begin{aligned} R_{\min j} \sim U (a_{j}, b_{j}), R_{\max j} \sim U (c_{j}, d_{j}), \\ λ_{j} \sim B (α_{j}, β_{j}), \end{aligned} \end{matrix}

where $a_{j}, b_{j}, c_{j},$ and d_j are the bounds of the uniform distributions and α_j and β_j are the parameters of the Beta distribution for each j.

Next, we find the posterior distributions of the parameters in the following lemma.

Lemma 1 (posterior distributions). Under the given priors in Eqs. (19), (20), and (21) and the observed data $Q_{1}, Q_{2}, \dots, Q_{n}$ , we have the following.

i.
The posterior distribution of Σ is given by
$\begin{matrix} (22) & Σ ∣ Q \sim W^{- 1} (ν_{0} + n, Ψ_{0} + S_{n}), \end{matrix}$
where
$S_{n} = S + \frac{n κ}{n + κ} (\overline{Q} - μ_{0}) (\overline{Q} - μ_{0})^{⊤}$
and
$S = \sum_{i = 1}^{n} (Q_{i} - \overline{Q}) (Q_{i} - \overline{Q})^{⊤}, \overline{Q} = \frac{1}{n} \sum_{i = 1}^{n} Q_{i} .$
ii.
The posterior distribution of μ is
$\begin{matrix} (23) & μ ∣ Q \sim T_{3 p} (μ_{n}, \frac{Ψ_{0} + S_{n}}{κ (n + κ)} (1 + \frac{1}{n}), ν_{0} + n - p + 1) \end{matrix}$
with
$μ_{n} = \frac{n \overline{Q} + κ μ_{0}}{n + κ},$
where $T_{p} (m, A, ν)$ represents a multivariate Student's t distribution with mean vector m, variance matrix A, and ν degrees of freedom.
iii.
The posterior distribution of R_minj is given by
$R_{\min j} ∣ Q \sim U (max (a_{j}, R_{\min j, obs}), b_{j}),$
where R_minj,obs is the minimum observed R_minj.
iv.
The posterior distribution of R_maxj is given by
$R_{\max j} ∣ Q \sim U (c_{j}, \min (d_{j}, R_{\max j, obs})),$
where R_maxj,obs is the maximum observed R_maxj.
v.
The posterior distribution of λ_j is given by
$λ_{j} ∣ Q \sim B (α_{j} + n, β_{j} + n) .$

Proof. According to Bayes' rule, the posterior distributions are derived as follows:

i.
The posterior of Σ is given by
$\begin{aligned} π (Σ ∣ Q) & \propto L (μ, Σ, {R_{\min j}}, {R_{\max j}}, {λ_{j}} ∣ Q) π (Σ) \\ \propto | Σ |^{- \frac{n}{2}} \exp (- \frac{1}{2} \sum_{i = 1}^{n} (Q_{i} - μ)^{⊤} Σ^{- 1} (Q_{i} - μ)) \\ \times | Σ |^{- (ν_{0} + 3 p + 1) / 2} \exp (- \frac{1}{2} tr (Ψ_{0} Σ^{- 1})) \\ \propto | Σ |^{- (ν_{0} + n + 3 p + 1) / 2} \exp (- \frac{1}{2} tr ((Ψ_{0} + S_{n}) Σ^{- 1})), \end{aligned}$
which is the kernel of an inverse Wishart distribution:
$Σ ∣ Q \sim W^{- 1} (ν_{0} + n, Ψ_{0} + S_{n}) .$
ii.
The posterior of μ is given by
$\begin{aligned} π (μ ∣ Q, Σ) & \propto L (μ, Σ ∣ Q) π (μ ∣ Σ) \\ \propto \exp (- \frac{1}{2} \sum_{i = 1}^{n} (Q_{i} - μ)^{⊤} Σ^{- 1} (Q_{i} - μ)) \\ \times \exp (- \frac{1}{2} (μ - μ_{0})^{⊤} {(\frac{κ}{Σ})}^{- 1} (μ - μ_{0})) \\ \propto \exp (- \frac{1}{2} (μ - μ_{n})^{⊤} {(\frac{1}{n + κ} Σ + \frac{κ}{Σ})}^{- 1} (μ - μ_{n})), \end{aligned}$
which results in a multivariate normal distribution for μ given Σ, i.e.,
$μ ∣ Q, Σ \sim N_{3 p} (μ_{n}, \frac{Σ}{n + κ}) .$
By integrating out Σ, we have
$μ ∣ Q \sim T_{3 p} (μ_{n}, \frac{Ψ_{0} + S_{n}}{κ (n + κ)} (1 + \frac{1}{n}), ν_{0} + n - p + 1) .$
iii.
For R_minj, combining the likelihood with the uniform prior results in the posterior distribution
$R_{\min j} ∣ Q \sim U (max (a_{j}, R_{\min j, obs}), b_{j}) .$
iv.
For R_maxj, combining the likelihood with the uniform prior results in the posterior distribution
$R_{\max j} ∣ Q \sim U (c_{j}, \min (d_{j}, R_{\max j, obs})) .$
v.
For λ_j, combining the likelihood with the Beta prior results in the posterior distribution
$λ_{j} ∣ Q \sim B (α_{j} + n, β_{j} + n) .$

□

Next, Theorem 3 provides the Bayes estimator of unknown parameters of a boxplot-valued random variable.

Theorem 3 (Bayes estimators under SEL). Under the assumptions of Lemma 1, the Bayes estimators under SEL, which are the means of the marginal posteriors, are obtained as follows:

i.
The Bayes estimator of μ is
$\begin{matrix} (24) & {\hat{μ}}_{Bayes} = \frac{n \overline{Q} + κ μ_{0}}{n + κ} . \end{matrix}$
ii.
The Bayes estimator of Σ is
${\hat{Σ}}_{Bayes} = \frac{Ψ_{0} + S_{n}}{ν_{0} + n - 3 p - 1},$
where
$S_{n} = S + \frac{n κ}{n + κ} (\overline{Q} - μ_{0}) (\overline{Q} - μ_{0})^{⊤}$
and
$S = \sum_{i = 1}^{n} (Q_{i} - \overline{Q}) (Q_{i} - \overline{Q})^{⊤}, \overline{Q} = \frac{1}{n} \sum_{i = 1}^{n} Q_{i} .$
iii.
The Bayes estimators of R_minj and R_maxj are
$\begin{aligned} {\hat{R}}_{\min j, Bayes} = \frac{max (a_{j}, R_{\min j, obs}) + b_{j}}{2}, \\ {\hat{R}}_{\max j, Bayes} = \frac{c_{j} + \min (d_{j}, R_{\max j, obs})}{2}, \end{aligned}$
where R_minj,obs and R_maxj,obs are the minimum observed R_minj and maximum observed R_maxj, respectively.
iv.
The Bayes estimator of λ_j is
${\hat{λ}}_{Bayes} = \frac{α_{j} + n}{α_{j} + β_{j} + 2 n} .$

Proof. The Bayes estimators are the means of the posterior distributions given in Lemma 1, and this completes the proof.

□

3.2 Simulation study in the Bayesian setup

Similar to Sect. 2.2, we conduct a simulation study to demonstrate the Bayesian estimation procedure. The simulation is carried out with the same setup and true values as described there, i.e., for p=3 variables with a sample size of n=100 and the true parameters specified in Eqs. (15)–(17).

Here, we use noninformative priors to ensure that the data predominantly influence the parameter estimates. In this simulation, a normal prior distribution with mean vector 0 and covariance matrix 10² I_p is used for μ. An inverse Wishart prior distribution is employed for Σ with a scale matrix S₀=I_p and degrees of freedom $ν = p + 1$ . For each λ_j, $j = 1, 2, 3$ a uniform 𝒰(0,1) ( $B (α_{j} = 1, β_{j} = 1)$ ) is used, which covers a broad range of values to avoid imposing any strong constraints.

The results of the Bayesian estimation, including the MSEs for each parameter, are summarized in Table 2.

Table 2Bayesian estimates and MSEs.

Download Print Version | Download XLSX

In order to compare the efficiency of the proposed ML and Bayesian estimators for different sample sizes n, we use the same simulation settings as those in Sects. 2.2 and 3.2. Figure 1 illustrates the relative efficiency (RE), defined as the MSE of the ML estimator over the MSE of the Bayesian estimator for the parameters μ and Σ. It can be seen that, for smaller sample sizes n, the efficiency of the Bayesian estimators is generally better (RE > 1). As n increases, thanks to the consistency of the ML estimators, the ML estimators tend to outperform the Bayesian estimators (RE < 1).

https://ascmo.copernicus.org/articles/11/73/2025/ascmo-11-73-2025-f01

Figure 1Relative efficiency of Bayesian and ML estimators in the simulation study (Sects. 2.2 and 3.2).

Download

4 Posterior predictive density estimator

Given the posterior distributions of the parameters μ, Σ, λ_j, R_minj, and R_maxj derived in Lemma 1, we can obtain the posterior predictive density for a new observation Q^*. Theorem 4 finds the posterior predictive density estimator for a future or new random variable Q^*.

Theorem 4 (posterior predictive density estimator). The posterior predictive density for a new random variable Q^* given the observable $Q_{1}, \dots, Q_{n}$ is as follows:

i.
When λ_j is known, the posterior predictive density estimator is given by
$\begin{matrix} (25) & \begin{aligned} Q^{*} ∣ Q_{1}, \dots, Q_{n} \\ \sim T_{3 p} (\frac{κ μ_{0} + n \overline{Q}}{κ + n}, \frac{(κ + n + 1) (Ψ_{0} + S_{n} / λ_{j})}{(κ + n) (ν_{0} + n - 3 p + 1)}, \\ ν_{0} + n - 3 p + 1) . \end{aligned} \end{matrix}$
ii.
When λ_j is unknown, the posterior predictive density is a mixture of multivariate Student's t distributions as given by
$\begin{matrix} (26) & \begin{aligned} p (Q^{*} ∣ Q_{1}, \dots, Q_{n}) \\ = \int T_{3 p} (\frac{κ μ_{0} + n \overline{Q}}{κ + n}, \frac{(κ + n + 1) (Ψ_{0} + \frac{1}{λ_{j}} S_{n})}{(κ + n) (ν_{0} + n - 3 p + 1)}, \\ ν_{0} + n - 3 p + 1) \times p (λ_{j} ∣ Q_{1}, \dots, Q_{n}) d λ_{j} . \end{aligned} \end{matrix}$

Proof. To derive the posterior predictive density, we need to integrate the parameters μ, Σ, and λ_j from the joint posterior distribution:

i.
When λ_j is known, from Lemma 1, the posterior distributions of μ and Σ are given by
$\begin{matrix} μ ∣ Σ, Q \sim N_{3 p} (\frac{κ μ_{0} + n \overline{Q}}{κ + n}, \frac{Σ}{κ + n}), \\ Σ ∣ Q \sim W_{3 p}^{- 1} (ν_{0} + n, Ψ_{0} + S_{n}), \end{matrix}$
and therefore the posterior predictive density is given by
$\begin{aligned} p (Q^{*} ∣ Q_{1}, \dots, Q_{n}) \\ = \int \int p (Q^{*} ∣ μ, Σ) p (μ, Σ ∣ Q_{1}, \dots, Q_{n}) d μ d Σ . \end{aligned}$
Since $p (μ, Σ ∣ Q_{1}, \dots, Q_{n})$ follows a normal inverse Wishart distribution, the resulting predictive density is a multivariate Student's t distribution:
$\begin{aligned} Q^{*} ∣ Q_{1}, \dots, Q_{n} \\ \sim T_{3 p} (\frac{κ μ_{0} + n \overline{Q}}{κ + n}, \frac{(κ + n + 1) (Ψ_{0} + S_{n})}{(κ + n) (ν_{0} + n - 3 p + 1)}, \\ ν_{0} + n - 3 p + 1) . \end{aligned}$
ii.
When λ_j is unknown, the posterior distributions of μ, Σ, and λ_j are
$\begin{array}{l} μ & ∣ Σ, λ_{j}, Q \sim N_{3 p} (\frac{κ μ_{0} + n \overline{Q}}{κ + n}, \frac{Σ}{λ_{j} (κ + n)}), \\ Σ & ∣ λ_{j}, Q \sim W_{3 p}^{- 1} (ν_{0} + n, Ψ_{0} + \frac{1}{λ_{j}} S_{n}), \\ λ_{j} & ∣ Q \sim p (λ_{j} ∣ Q_{1}, \dots, Q_{n}) . \end{array}$

Then, the posterior predictive density is given by
$\begin{aligned} p (Q^{*} ∣ Q_{1}, \dots, Q_{n}) \\ = \int \int \int p (Q^{*} ∣ μ, Σ, λ_{j}) p (μ, Σ, λ_{j} ∣ Q_{1}, \dots, Q_{n}) \\ \cdot d μ d Σ d λ_{j}, \end{aligned}$
and this completes the proof.

□

Remark 1. In addition to the posterior predictive density derived in Theorem 4, two other types of density estimators can be considered: plugin density estimators. We assume that λ is known or has been estimated using either Bayesian or ML methods.

i.
ML plugin density estimator: by plugging in the ML estimators for μ, Σ, and λ from Theorem 2, the density estimator is given by
$p (Q^{*} ∣ \hat{μ}, \hat{Σ}, \hat{λ}) = N_{3 p} (\hat{μ}, \hat{Σ}) .$
ii.
Bayesian plugin density estimator: by plugging in the Bayesian estimators for μ, Σ, and λ from Theorem 3, the density estimator is
$p (Q^{*} ∣ {\hat{μ}}_{Bayes}, {\hat{Σ}}_{Bayes}, {\hat{λ}}_{Bayes}) = N_{3 p} ({\hat{μ}}_{Bayes}, {\hat{Σ}}_{Bayes}) .$
Note that, in both cases, λ influences the range but not the normal distribution part.

4.1 Comparison of density estimators using the expected Kullback–Leibler (KL) Loss

The KL loss function measures the difference between two probability distributions. For a true density p(Q) and an estimated density q(Q), the KL loss is defined as

\begin{matrix} (27) & D_{KL} (p ‖ q) = \int p (Q) \log (\frac{p (Q)}{q (Q)}) d Q . \end{matrix}

The expected KL loss, also known as the KL risk function, for a density estimator q is the expectation of KL loss over the distribution of the data and is given as

\begin{matrix} (28) & R_{KL} (q) = E [D_{KL} (p ‖ q)] . \end{matrix}

To evaluate the performance of different predictive density estimators, we compare the KL risk performance of three methods in estimating the future density, i.e., Bayesian predictive, Bayesian plugin, and ML plugin estimators. The next lemma helps to find the KL risk functions.

Lemma 2. The KL loss between two multivariate normal distributions 𝒩_p(μ₁,Σ₁) and 𝒩_p(μ₂,Σ₂) is given by

\begin{aligned} KL (N_{p} (μ_{1}, Σ_{1}) ∥ N_{p} (μ_{2}, Σ_{2})) \\ = \frac{1}{2} [tr (Σ_{2}^{- 1} Σ_{1}) + (μ_{2} - μ_{1})^{⊤} Σ_{2}^{- 1} (μ_{2} - μ_{1}) \\ - p + \log \frac{\det (Σ_{2})}{\det (Σ_{1})}] . \end{aligned}

The KL divergence between a multivariate normal distribution and a multivariate Student's t distribution is given by

\begin{aligned} KL (N_{p} (μ_{1}, Σ_{1}) ∥ T_{p} (μ_{2}, Σ_{2}, ν)) \\ = \frac{1}{2} [tr (Σ_{1}^{- 1} Σ_{2}) + (μ_{2} - μ_{1})^{⊤} Σ_{1}^{- 1} (μ_{2} - μ_{1}) \\ - p + \log \frac{\det (Σ_{2})}{\det (Σ_{1})}] - \frac{p}{2} (\frac{ν - p + 1}{ν}), \end{aligned}

where tr(⋅) denotes the trace of a given matrix.

Proof. The KL loss function between two multivariate normal densities is straightforward and therefore omitted. Given the density function of the multivariate normal distribution as

p (Q) = \frac{1}{(2 π)^{p / 2} | Σ_{1} |^{1 / 2}} \exp (- \frac{1}{2} (Q - μ_{1})^{⊤} Σ_{1}^{- 1} (Q - μ_{1}))

and the density function of the multivariate Student's t distribution as

\begin{aligned} q (Q) & = \frac{Γ (\frac{ν + p}{2})}{Γ (\frac{ν}{2}) (ν π)^{p / 2} | Σ_{2} |^{1 / 2}} \\ \cdot {(1 + \frac{1}{ν} (Q - μ_{2})^{⊤} Σ_{2}^{- 1} (Q - μ_{2}))}^{- \frac{ν + p}{2}}, \end{aligned}

substituting these into the KL loss function in Eq. (27) and simplifying them gives the result.

□

4.1.1 KL risk function comparison simulation

We conducted a simulation study to evaluate the KL risk for different sample sizes using three methods: Bayesian predictive, Bayesian plugin, and ML plugin estimators. The true parameters and hyperparameters used in the simulation are as follows. The true parameters are defined as

μ = [\begin{array}{c} 1 \\ 2 \\ 3 \end{array}], Σ = [\begin{array}{ccc} 1 & 0.5 & 0.3 \\ 0.5 & 2 & 0.4 \\ 0.3 & 0.4 & 3 \end{array}],

and the prior parameters are defined as

μ_{0} = [\begin{array}{c} 0 \\ 0 \end{array}], κ_{0} = 0.0001, ν_{0} = 3, Ψ_{0} = [\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}] .

Table 3KL risk comparison of different density estimators with different sample sizes.

Download Print Version | Download XLSX

As seen from Table 3 and Fig. 2, the KL risks for the Bayesian predictive and plugin density estimators are quite close. Both risk functions are fairly small, and the difference between them decreases as the sample size n increases. Moreover, the KL risk for the ML plugin density estimator is smaller than that for the other two estimators, and it decreases as n increases. This suggests that the ML plugin density estimator performs better when estimating the new density. However, it is important to note that both Bayesian estimators were based on noninformative priors by choosing a very small value for κ₀.

To visualize the comparison of KL divergence for different sample sizes, we present a plot in Fig. 2.

https://ascmo.copernicus.org/articles/11/73/2025/ascmo-11-73-2025-f02

Figure 2KL risk function comparison of different density estimators for different sample sizes.

Download

5 Applications in climatology

In this section, we apply the methodologies discussed – specifically the point estimation methods (both ML and Bayesian) and density estimations – to real climatological data. This approach demonstrates the practical utility and effectiveness of these statistical techniques in analyzing and interpreting complex boxplot-valued environmental data. The data are sourced from the Berkeley Earth Surface Temperature Study, which aggregates 1.6 billion temperature reports from 16 pre-existing archives affiliated with the Lawrence Berkeley National Laboratory (data can be found in the Berkeley Earth Surface Temperature Study at https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data, last access: 19 July 2024).

The section is divided into two subsections, each focusing on a distinct aspect of climatological analysis. First, we examine the monthly average temperatures in the United States from January 2000 to September 2013. In the latter subsection, we analyze summer average temperatures in European countries, with data going back to the 18th century. Both datasets are vast, containing extensive data. Aggregating these data into boxplots allows for computational efficiency by representing large amounts of information in a concise form.

5.1 Analysis of monthly average temperatures in the United States

5.1.1 Point estimators

The dataset used for this analysis includes monthly average temperatures from various cities across the United States, spanning from January 2000 to September 2013. These extensive data provide a comprehensive view of temperature trends and variations over a significant period. Here, each X_i represents the monthly average temperature summaries across multiple cities for month i within this period, capturing the variability of average temperatures in the form of symbolic data.

For each month, the first quartile (q₁), median, and third quartile (q₃) of the average temperatures across cities were calculated. Thus, each $Q_{i} = (q_{1 i}, m_{i}, q_{3 i})^{⊤}$ represents a vector of three summary statistics for month i, where the dimension of the data is defined as p=1, corresponding to the single temperature variable. These quartiles are then used to estimate the mean vector and covariance matrix using both ML and Bayesian methods.

The total number of monthly observations, n=165, spans from January 2000 to September 2013, resulting in 165 vectors of symbolic data, each representing monthly summary statistics of average temperatures. The hyperparameters for the Bayesian estimation are as follows: κ=0.001, $μ_{0} = (0, 0, 0)^{⊤}$ , ν₀=3, $Ψ_{0} = I_{3 \times 3}$ , a=0.3, b=0.4, c=1.8, and d=2.0.

Table 4 summarizes the results from both the ML and Bayesian estimations.

Table 4Comparison of ML and Bayesian estimates for monthly average temperatures in the United States.

Download Print Version | Download XLSX

5.1.2 Density estimators

In this section, we explore three methods for estimating predictive densities: posterior predictive density, the ML method, and the Bayesian plugin method. These methods provide different approaches to predicting future monthly average temperatures in the United States.

The ML and Bayesian plugin density estimators, using the ML and Bayesian estimators for μ, Σ, and λ from Table 4, can be obtained easily and are denoted by $p (Q^{*} ∣ \hat{μ}, \hat{Σ}, \hat{λ})$ and $p (Q^{*} ∣ {\hat{μ}}_{Bayes}, {\hat{Σ}}_{Bayes}, {\hat{λ}}_{Bayes})$ , respectively. On the other hand, the posterior predictive density for a new random variable Q^* given the observations $Q_{1}, \dots, Q_{n}$ when λ is unknown is a mixture of multivariate Student's t distributions, as given in Eq. (26). If λ has already been estimated using ML or Bayesian methods, the posterior predictive density simplifies to a Student's t distribution as shown in Eq. (25).

https://ascmo.copernicus.org/articles/11/73/2025/ascmo-11-73-2025-f03

Figure 3Boxplots of summer average temperatures in European countries.

Download

5.2 Analysis of summer average temperatures in European countries

In this subsection, we extend our analysis to examine summer average temperatures across various European countries, utilizing historical climate data that provide comprehensive temperature records dating back to the 18th century. The dataset includes average monthly temperature data measured from numerous weather stations situated in major cities across Europe, allowing for an in-depth analysis of long-term climate trends and variations.

The analysis focuses on temperature records from 41 European countries, covering the three summer months June, July, and August. This extensive temporal and spatial coverage enables us to explore regional differences and similarities in temperature trends, contributing to a broader understanding of global climatic patterns.

Table 5ML and Bayesian estimates for summer average temperatures in European countries.

Download Print Version | Download XLSX

The dataset was filtered to include only European countries, with the average temperatures for the summer months extracted for further analysis. Boxplots were then generated to visualize the distribution of average temperatures across countries during the summer season. Each boxplot represents the distribution of average temperatures for a specific country across the three summer months.

Figure 3 presents these boxplots, providing a clear illustration of the temperature variations and trends across Europe. This comparative study highlights the regional climatic patterns, offering valuable insights into the broader context of global climate change.

To further understand the differences between the ML and Bayesian approaches in estimating summer average temperatures in European countries, we present a comparative analysis of the estimators for the mean vectors and covariance matrices across the three summer months: June, July, and August. Table 5 summarizes the results.

The table above presents a side-by-side comparison of the ML and Bayesian estimators for the mean vectors and covariance matrices of summer average temperatures in European countries. The results show a close alignment between the ML and Bayesian estimates for the mean vectors across the three summer months. However, the covariance matrices exhibit some differences, particularly in the off-diagonal elements, which suggest variations in how each method captures the relationships between temperature readings across different months.

For instance, the Bayesian estimators tend to produce more consistent estimates across the covariance matrices, potentially reflecting the smoothing effect of prior information. The parameters R_min and R_max for each month also indicate that the range of temperature variations is slightly narrower in the Bayesian framework, further highlighting the influence of the priors.

6 Discussion

In this article, we have introduced a novel method for estimating the parameters of boxplot-valued random variables from both Bayesian and frequentist perspectives. Working with boxplot-valued random variables enables us to efficiently summarize large datasets, as boxplots concisely represent extensive amounts of information. Our method offers a significant computational advantage, as we derived closed-form solutions for both ML and Bayesian estimators. These closed-form expressions allow for direct computation of parameter estimates without the need for iterative optimization procedures or Markov chain Monte Carlo (MCMC)-type methods, which are commonly required in traditional Bayesian approaches. By avoiding these computationally intensive steps, our method reduces the overall complexity and computational load, making it particularly suitable for large datasets.

Additionally, we have explored the density estimation of future random boxplots, providing a comprehensive tool for investigating and forecasting these distributions. Our approach was evaluated through simulations and demonstrated its accuracy and computational efficiency when compared with existing methods. Furthermore, we applied the proposed techniques to real environmental data to illustrate their practical utility and effectiveness in real-world scenarios.

Boxplot-valued data analysis is particularly useful in contexts where data are collected in large volumes and need to be summarized effectively without losing significant information. By adopting both Bayesian and frequentist methods, we provide a comprehensive framework that leverages the strengths of each approach. The Bayesian methods offer a probabilistic interpretation and incorporate prior information, which is valuable when data are scarce or expensive to obtain. On the other hand, the frequentist methods provide consistency and robustness when large amounts of data are available.

Furthermore, our study on the density estimation of future random boxplots enhances predictive modeling capabilities, which is crucial for applications such as environmental monitoring and climate forecasting. By accurately estimating the distribution of future data, decision-makers can make more informed predictions and plans.

Through extensive simulations, we validated the performance of our proposed methods, showcasing their accuracy and reliability. The application to real environmental data further underscores the practical relevance and adaptability of our techniques. Moreover, although the multivariate normal model theoretically allows for the possibility of overlapping quartiles, our empirical studies – including both simulations and real-world applications – consistently resulted in ordered quartiles ( $q_{1 i} \leq m_{i} \leq q_{3 i}$ ). This empirical consistency indicates that, within the context of our data and analysis, the model effectively preserves the natural ordering of quartiles, thereby mitigating the theoretical limitation in practical scenarios. Overall, this work contributes to the growing field of symbolic data analysis by offering efficient and effective tools for handling boxplot-valued data, thereby broadening the scope and applicability of statistical methodologies in various domains.

Code availability

The R code used for the analysis in this study is not publicly available but can be shared upon reasonable request. Due to institutional policies and ongoing related research, we have opted not to deposit the code in a public repository at this time. However, researchers interested in accessing the code may contact the corresponding author at asadeghkhani@ncat.edu.

Data availability

The data used in this study are publicly available and can be accessed from the Kaggle repository at https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data (Berkeley Earth, 2017). The dataset was obtained from this source and used in accordance with its terms of use. Researchers interested in the dataset can directly access it via the provided link.

Author contributions

AbS and AlS contributed equally to the development of the statistical methodology, study design, and data analysis. AbS prepared the initial manuscript draft, and both authors participated in its review and revision.

Competing interests

The contact author has declared that neither of the authors has any competing interests.

Disclaimer

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

Acknowledgements

The authors sincerely thank the editors and the two anonymous reviewers for their constructive feedback, which significantly improved the quality of this paper.

Review statement

This paper was edited by Reetam Majumder and reviewed by two anonymous referees.

References

Arroyo, J., Maté, C., and Roque, A. M.-S.: Hierarchical clustering for boxplot variables, in: Data Science and Classification, edited by: Batagelj, V., Bock, H.-H., Ferligoj, A., and Žiberna, A., Springer, 59–66, https://doi.org/10.1007/3-540-34416-0_7, 2006. a

Benjamini, Y.: Opening the box of a boxplot, Am. Stat., 42, 257–262, 1988. a, b

Berkeley Earth: Climate Change: Earth Surface Temperature Data, Kaggle [data set], https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data (last access: 19 July 2024), 2017. a

Billard, L. and Diday, E.: Regression analysis for interval-valued data, in: Data analysis, classification, and related methods, edited by: Kiers, H. A. L., Rasson, J. P., Groenen, P. J. F., and Schader, M., Springer, 369–374, https://doi.org/10.1007/978-3-642-59789-3_58, 2000. a

Billard, L. and Diday, E.: From the statistics of data to the statistics of knowledge: symbolic data analysis, J. Am. Stat. Assoc., 98, 470–487, 2003. a

Chambers, J. M.: Graphical methods for data analysis, Chapman and Hall/CRC, https://doi.org/10.1201/9781351072304, 2018. a

Diday, E.: The symbolic approach in clustering and related methods of data analysis: the basic choices, in: Classification and Related Methods of Data Analysis, Proceedings of the First Conference of the International Federation of Classification Societies (IFCS-87: Technical University of Aachen, 673–684, North Holland, NII Article ID 10011477669, 1988. a, b

Diday, E.: Probabilist, possibilist and belief objects for knowledge analysis, Ann. Oper. Res., 55, 225–276, 1995. a

Diday, E. and Noirhomme-Fraiture, M.: Symbolic data analysis and the SODAS software, John Wiley & Sons, ISBN 978-0470018835, 2008. a

Douzal-Chouakria, A., Billard, L., and Diday, E.: Principal component analysis for interval-valued observations, Stat. Anal. Data Min., 4, 229–246, 2011. a

Émilion, R.: Différentiation des capacités et des intégrales de Choquet, C.R. Acad. Sci. I-Math., 324, 389–392, 1997. a

Ferguson, T. S.: A course in large sample theory, Routledge, ISBN 0412043718 2017. a

Le-Rademacher, J. and Billard, L.: Likelihood functions and some maximum likelihood estimators for symbolic data, J. Stat. Plan. Infer., 141, 1593–1602, 2011. a

Neto, E. d. A. L. and De Carvalho, F. D. A.: Centre and range method for fitting a linear regression model to symbolic interval data, Comput. Stat. Data An., 52, 1500–1515, 2008. a, b

Neto, E. d. A. L. and De Carvalho, F. D. A.: Constrained linear regression models for symbolic interval-valued variables, Comput. Stat. Data An., 54, 333–347, 2010. a

Reyes, D. M., de Souza, R. M., and de Oliveira, A. L.: A three-stage approach for modeling multiple time series applied to symbolic quartile data, Expert Syst. Appl., 187, 115884, https://doi.org/10.1016/j.eswa.2021.115884, 2022. a

Reyes, D. M., Souza, L. C., de Souza, R. M., and de Oliveira, A. L.: Parametrized linear regression for boxplot-multivalued data applied to the Brazilian Electric Sector, Inform. Sci., 652, 119758, https://doi.org/10.1016/j.ins.2023.119758, 2024. a

Sadeghkhani, A. and Sadeghkhani, A.: Multivariate Interval-Valued Models in Frequentist and Bayesian Schemes, arXiv [preprint], https://doi.org/10.48550/arXiv.2405.06635, 2024. a, b

Samadi, S. Y., Billard, L., Guo, J. H., and Xu, W.: MLE for the parameters of bivariate interval-valued model, Adv. Data Anal. Classif., 18, 827–850, https://doi.org/10.1007/s11634-023-00546-6, 2024. a

Tukey, J.: Exploratory data analysis, Springer, https://doi.org/10.1007/978-3-031-20719-8_2, 1977. a, b

Wickham, H. and Wickham, H.: Getting Started with ggplot2, ggplot2: Elegant graphics for data analysis, Springer, 11–31, https://doi.org/10.1007/978-3-319-24277-4_2, 2016. a

Xiong, T., Li, C., Bao, Y., Hu, Z., and Zhang, L.: A combination method for interval forecasting of agricultural commodity futures prices, Knowl.-Based Syst., 77, 92–102, 2015. a

Xu, M. and Qin, Z.: A bivariate Bayesian method for interval-valued regression models, Knowl.-Based Syst., 235, 107396, https://doi.org/10.1016/j.knosys.2021.107396, 2022. a

Zhao, Q., Wang, H., and Wang, S.: Robust regression for interval-valued data based on midpoints and log-ranges, Adv. Data Anal. Classi., 17, 583–621, 2023. a

Articles

Short summary

This paper presents a novel study on boxplot-valued data in climatological applications. Our methodologies are applied to the Berkeley Earth Surface Temperature Study. We validate our approaches through comprehensive simulations, comparing Bayesian and frequentist estimators for efficiency and accuracy. The results provide robust insights into climatic trends, particularly summer average temperatures across European countries.