Gaussian process (GP) regression is a flexible modeling technique used to predict outputs and to capture uncertainty in the predictions. However, the GP regression process becomes computationally intensive when the training spatial dataset has a large number of observations. To address this challenge, we introduce a scalable GP algorithm, termed MuyGPs, which incorporates nearest-neighbor and leave-one-out cross-validation during training. This approach enables the evaluation of large spatial datasets with state-of-the-art accuracy and speed in certain spatial problems. Despite these advantages, conventional quadratic loss functions used in the MuyGPs optimization, such as root mean squared error (RMSE), are highly influenced by outliers. We explore the behavior of MuyGPs in cases involving outlying observations and, subsequently, develop a robust approach to handle and mitigate their impact. Specifically, we introduce a novel leave-one-out loss function based on the pseudo-Huber function (LOOPH) that effectively accounts for outliers in large spatial datasets within the MuyGPs framework. Our simulation study shows that the LOOPH loss method maintains accuracy despite outlying observations, establishing MuyGPs as a powerful tool for mitigating unusual observation impacts in the large data regime. In the analysis of US ozone data, MuyGPs provides accurate predictions and uncertainty quantification, demonstrating its utility in managing data anomalies. Through these efforts, we advance the understanding of GP regression in spatial contexts.

Gaussian process (GP) regression is widely known to be a powerful and versatile framework for modeling non-linear relationships in various fields. Particularly, in spatial data analysis, GP regression's capability to effectively account for the correlation among all data points makes it an attractive choice for interpolating highly non-linear targets

Fundamentally, in spatial data analysis, GP estimation aims to learn the covariance model hyperparameters (denoted as

It is imperative to reinforce the MuyGPs algorithm against potential outliers that could compromise the accuracy of the GP predictions and associated uncertainties. Outliers are present in numerous domains such as environmental data, where variables such as air quality can exhibit outlying values. Researchers such as

To address concerns in both hyperparameter estimation and prediction from spatial datasets with outliers, we propose a refinement to the MuyGPs optimization algorithm. First, we introduce a novel contribution called the leave-one-out pseudo-Huber (LOOPH) loss function. The LOOPH loss function combines ideas popularized by

In this section, we outline in detail the methodology behind the robust approach applied to the MuyGPs algorithm. At the core of our methodology lies the integration of the variance-regularizing pseudo-Huber loss function, pioneered by

Consider a spatial GP regression of the form:

The methodology behind MuyGPs is derived from the union of two concepts:

The MuyGPs training process then minimize several loss functions such as the mean squared error (MSE), the cross entropy loss and the leave-one-out likelihood (LOOL) loss over a randomly sampled batch of training points,

For a randomly selected training batch

In the context of GP regression, addressing outliers is pivotal for ensuring model robustness and reliable predictions. To tackle this challenge, we turn to the pseudo-Huber loss function, which has garnered recognition for its effectiveness in reducing the impact of outliers

The pseudo-Huber loss function for different boundary-scale values. Each curve represents a different boundary scale, demonstrating the transition from quadratic behavior for small residuals to linear behavior for large residuals.

To enhance the loss function's sensitivity to variance-affecting parameters, we introduce a novel method, the LOOPH. This method scales and regularizes the pseudo-Huber loss, ensuring that it reacts more strongly to parameters influencing variance. The formulation of the LOOPH is as follows:

Plots of the LOOPH function

Our comprehensive set of visualizations in Fig. 2 includes heatmaps that depict the loss surface across a range of

Comparison of loss functions for different values of

To compare LOOL and LOOPH loss functions, we tested them with different

To enhance the robustness and predictive reliability of the MuyGPs algorithm, we consider an innovative strategy that incorporates down-sampling of nearest neighbors and repeated evaluations to derive a central value of the distribution such as the median. This technique aims to strengthen the stability and accuracy of the MuyGPs training and prediction process, particularly when confronted with outliers and other perturbations. In our analysis, we assume that we know the true

Down-sampling algorithm.

Begin by selecting the

Down-sample the nearest-neighbor points and compute the objective function using Bayesian optimization to obtain predictions for the

Fix the robust

Estimate the

Down-sample the nearest neighbors again to predict the response of the test data at a fixed number of iterations. Each iteration results in a distribution of predictions. Compute the median of these predictions to obtain a central value, contributing to robustness and predictive reliability, especially when dealing with outliers and other sources of variability.

The validation of the described robust approach will be presented in the subsequent section, where we delve into the numerical results and performance analysis. We will then test a hybrid method, where we only use the down-sample strategy for

To assess the effectiveness of the proposed LOOPH method and batch sub-sampling technique, we conducted a series of experiments using a simulated dataset and a real dataset. Throughout these experiments, we followed the structure below:

We fitted MuyGPs models using the following three methods:

regular sampling method, which is the traditional MuyGPs implementation;

hybrid method, which involves the down-sampling of nearest-neighbor indices only for

down-sampling method (see Algorithm 1).

For each MuyGPs model, we applied the LOOL (Eq. 6) and LOOPH (Eq. 8) loss functions.

We fitted three additional models that employ the negative log likelihood (NLL) as the underlying loss function. The models are as follows:

conventional GP, a conventional GP model fit using Fields R package

LAGP, a local approximate GP regression model for large spatial datasets;

Student's

For all of the models, we used pure (non-outlying) and outlying data for comparison and varied three

The simulation study employs a simple two-dimensional curve generated from

In order to investigate the robustness of our model to outliers, we introduce anomalous data points into the training set. This is achieved by randomly selecting a subset of training data indices and multiplying the corresponding target values by a specified factor. Specifically, we randomly choose

Comparison of the spatial training data after injecting outliers. The first box plot represents the data with no outliers, while the second box plot shows the data after injecting

Analysis of outlier effects across three models. Panels

The plots illustrated in Fig. 6 serve as valuable tools for gaining insights into our findings regarding outlier effects. In the left column, we analyze the residuals computed from three models. Notably, the model trained on data with outliers exhibit considerably larger residuals, which could potentially impact the validity of our inferences. In the middle column, we examine the size of the

The simulation results are all based on

Results for model evaluation using RMSE, MDV, and median CI size metrics for non-outlying data. The

We next summarize the results of our simulation study in Table 2, demonstrating our models' effectiveness in capturing underlying data patterns and their robustness in handling outliers.

Results for model evaluation using RMSE, MDV, and median CI size metrics for data with 10 % outliers in it. The

Examining the outcomes presented in Tables 1 and 2 reveals several noteworthy insights. It is important to remember that traditional implementation of the MuyGPs method is noted here as regular sampling with the LOOL loss function. All other rows indicate a novel method that we propose, with either a new loss function or novel use of data in order to account for the outliers along with conventional GP methods and one existing robust method for comparison. Our models exhibit exceptional accuracy when applied to clean data for both loss methods. This is evident in the form of low RMSE, especially when

In this subsection, we analyze the US air quality data from various locations within Los Angeles (LA), CA, in 1988. We considered the region's historical ozone levels, which have been notably high due to its status as a large metropolitan area. Throughout the 1980s and 1990s, LA recorded ozone levels exceeding 200 parts per billion (ppb). Although this dataset does not contain significant outliers, it is still critical to use a robust approach for accurate environmental analysis to account for potential future outliers that could be caused by climate change. Ozone levels are typically influenced by numerous factors, including weather patterns, emissions from various sources, and chemical reactions in the atmosphere. Our robust modeling approach ensures that the analysis remains reliable even when data variability is high or when there are subtle anomalies that traditional methods might overlook. By applying our robust techniques, we can better account for the complex nature of this dataset and improve the reliability of predictions and interventions aimed at mitigating air pollution.

We collected meteorological data from the National Climatic Data Center (NCDC), which provides

Panel

We followed a modeling approach almost similar to our simulation study using temperature and wind speed as variables in our feature matrix and daily ozone concentration as our target variable. The feature variables were normalized using min–max scaling to transform the values to a range of

Results for MuyGPs models evaluation using RMSE, MDV, and Median CI size. Bold values represent the best accuracy and UQ statistics for each metric setting.

Results for MuyGPs models evaluation using RMSE, MDV, and Median CI size for Ozone data with 10 % outliers generated in it. Bold values represent the best accuracy and UQ statistics for each metric setting.

The above findings in Tables 3 and 4 provide a comprehensive view of the performance evaluation metrics for different batch sampling methods. It demonstrates how the models respond to various conditions, including the presence of outliers and the choice of loss functions (LOOL and LOOPH). Generally, these results highlight the trade-offs between accuracy and robustness in GP modeling. Even without injection of outliers, our LOOPH loss method demonstrates improved uncertainty quantification, where smaller variances better represent the true uncertainty in the data. Further, the hybrid sampling method with the LOOPH loss function emerges as the most favorable approach for the ozone dataset with

In this study, we investigated the behavior and robustness of GP regression models, particularly focusing on a scalable GP algorithm called MuyGPs, when confronted with outlier-affected spatial datasets. We proposed a novel leave-one-out pseudo-Huber (LOOPH) loss method and a down-sampling strategy to enhance the algorithm's robustness and improved prediction capability. Our numerical studies, conducted on both simulated and real-world datasets, provided valuable insights into the capabilities of MuyGPs in handling outliers and improving the reliability of GP regression models.

The simulation experiments revealed that MuyGPs, when featuring the LOOPH loss method, maintains low RMSE, small MDV, and accurate confidence intervals even in the presence of extreme observations. Additionally, the down-sampling approach further improved the model's robustness and predictive capabilities, especially when dealing with outlier-affected data, highlighting its potential as a powerful tool for mitigating the adverse effects of unusual observations.

Analyzing real-world US ozone data from LA in 1988, we observed that MuyGPs using the LOOPH loss method provides accurate predictions and uncertainty quantification, even when outliers are present. The down-sampling strategy reinforced the algorithm's robustness, making it an attractive choice for applications involving large spatial datasets with potential outliers.

Our study underscores the importance of considering the impact of outliers when employing GP regression models and highlights the potential of the MuyGPs algorithm, especially when featuring the proposed LOOPH loss method and down-sampling techniques. These tools offer practitioners a means of maintaining predictive accuracy and reliable uncertainty quantification even in challenging and large spatial data scenarios. Overall, this work contributes to advancing the understanding of GP regression in the spatial context and offers practical solutions to enhance its applicability in the presence of outliers in the large spatial data regime.

Below we illustrate all the metrics computed to evaluate different GP models during the simulation study.

Results for model evaluation using RMSE, CRPS, MAD, MDV, median CI size, and coverage. The

Results for model evaluation using RMSE, CRPS, MAD, MDV, median CI size, and coverage for data with

Here we report the results obtained from fitting different MuyGPs models for the US ozone data.

Results for MuyGPs models evaluation using RMSE, MDV, and Median CI size. Bold values represent the best accuracy and UQ statistics for each metric setting.

Results for MuyGPs models evaluation using RMSE, MDV, and Median CI size for Ozone data with 10 % outliers generated in it. Bold values represent the best accuracy and UQ statistics for each metric setting.

All codes producing results in this paper can be accessed through

The maximum daily 8 h average ozone data can be accessed through the US EPA's Air Data website at

JM conducted all simulations and data analyses and authored the paper. AM and BWP contributed to the paper by providing editorial support and guidance throughout the project and are the creators of the MuyGPs algorithm. BWP integrated the developed methods into the MuyGPs algorithm.

The contact author has declared that none of the authors has any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims made in the text, published maps, institutional affiliations, or any other geographical representation in this paper. While Copernicus Publications makes every effort to include appropriate place names, the final responsibility lies with the authors.

This work was performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under contract no. DE-AC52-07NA27344 with IM release no. LLNL-JRNL-860785 and was supported by the LLNL Lab Directed Research and Development (LDRD) program under project no. 22-ERD-028. We extend our sincere gratitude to Lawrence Livermore National Laboratory for their invaluable support and resources throughout the course of this research. Their commitment to scientific excellence and dedication to advancing knowledge have been instrumental in the success of this study. We are deeply appreciative of their contributions to our work.

This research has been supported by the LLNL-LDRD program (project no. 22-ERD-028).

This paper was edited by Likun Zhang and reviewed by two anonymous referees.