Scientific records of temperature and precipitation have been kept for several hundred years, but for many areas, only a shorter record exists. To understand climate change, there is a need for rigorous statistical reconstructions of the paleoclimate using proxy data. Paleoclimate proxy data are often sparse, noisy, indirect measurements of the climate process of interest, making each proxy uniquely challenging to model statistically. We reconstruct spatially explicit temperature surfaces from sparse and noisy measurements recorded at historical United States military forts and other observer stations from 1820 to 1894. One common method for reconstructing the paleoclimate from proxy data is principal component regression (PCR). With PCR, one learns a statistical relationship between the paleoclimate proxy data and a set of climate observations that are used as patterns for potential reconstruction scenarios. We explore PCR in a Bayesian hierarchical framework, extending classical PCR in a variety of ways. First, we model the latent principal components probabilistically, accounting for measurement error in the observational data. Next, we extend our method to better accommodate outliers that occur in the proxy data. Finally, we explore alternatives to the truncation of lower-order principal components using different regularization techniques. One fundamental challenge in paleoclimate reconstruction efforts is the lack of out-of-sample data for predictive validation. Cross-validation is of potential value, but is computationally expensive and potentially sensitive to outliers in sparse data scenarios. To overcome the limitations that a lack of out-of-sample records presents, we test our methods using a simulation study, applying proper scoring rules including a computationally efficient approximation to leave-one-out cross-validation using the log score to validate model performance. The result of our analysis is a spatially explicit reconstruction of spatio-temporal temperature from a very sparse historical record.

There is a need for accurate estimates of paleoclimate, especially
temperature and precipitation, to better understand how climate has changed
in the past. Scientific measurements of temperature and precipitation have
been recorded for several hundred years, and in many locations for a much
shorter time. Because of long-standing interest in weather, there are a vast
number of anecdotal, nonscientific records of weather. However, many
reconstructions of paleoclimate using compiled historical records are not
amenable to direct statistical analysis because they consist of imprecise
measurements of weather reported in letters, newspapers, books, and other
documents

Historical observer weather data are often unreliable, sparse both temporally
and spatially, and noisy because these data were recorded before widespread
adoption of scientific measurement standards. As a result, historical
observer weather data have not been widely used for rigorous statistical
reconstructions of climate because these challenges make it difficult to
create generic statistical approaches for analysis. Historical observer
climate data can occur at hourly, daily, or monthly timescales, and the
current-era analog data used to train statistical models can also vary in
temporal resolution. Therefore, there is often a change of temporal support
between the historical observer and current-era analog data that must be
accounted for

Another complication is that the true target one wishes to predict (the historical, unobserved climate) is never available to evaluate model predictive performance. Moreover, the historical observer data are often of unknown or of varying reliability and are typically sparse, sometimes involving only a few locations per year. The consequences of such data characteristics for evaluating model performance are underexplored; hence, we explore methods to validate historical observer-era model predictions under these sparse data scenarios.

We used spatially and temporally sparse historical observer measurements of temperature recorded at United States (US) military forts and other historical observer stations to reconstruct spatially explicit maps of mean mid-day July temperature by leveraging modern spatially explicit current-era analog data to impute missing spatial structure. We perform the reconstruction within a model framework that accounts for uncertainty in current-era data products and uncertainty in parameter estimation, and properly evaluates predictive skill. We test eight model specifications using a simulation study, generate predictions for mid-day July temperature at approximately 20 000 locations for each year in 1820–1894 with associated uncertainties, and evaluate model performance using a computationally efficient approximation to leave-one-out cross-validation.

We used two datasets we refer to as the

Protocols varied across the observer stations through space and time, leading
to many irregularities in the historical observer data. Temperature
measurements were obtained by a variety of methods: some records report daily
minimum and maximum temperatures, others report hourly measurements, and
sometimes there are days or weeks with missing measurements. In addition, the
number and locations of the observer stations change through time, containing
between 1 and 234 locations per year; this variation is due to historical
events, including the Civil War and the westward expansion of the US in the
late 19th century. Most years have only a few observations and, in general,
the number of observer locations per year increases through time. Therefore,
the model must align the temporal and spatial scales of the two data sources
to reconstruct continuous temperature fields across the Upper Midwest. An
example of 4 years of historical data is shown in Fig.

Because the historical observer data are spatially sparse, traditional
spatial statistical methods, such as Kriging, are not applicable, as these
methods require larger sample sizes to produce reasonable predictive
surfaces. Thus, we used the current-era analog data to provide spatial
structure for the reconstruction. For the current-era analog data, we used
the Parameter-elevation Relationships on Independent Slopes Model (PRISM)
monthly mean mid-day temperature surfaces created by interpolation of the US
Historical Climate Network (USHCN) data over the period 1895–2010

Four years of the historical observer temperature data

To enable statistical learning about climate in the historical observer
period, we aligned the two data sources to common spatial and temporal
scales. We assigned each historical observer station to the closest grid cell
in the current-era analog data, thus accounting for any potential spatial
misalignment. Because the grid we aligned to is very fine scale
(800 m

We define our models using the following notation. Scalars are denoted by
lowercase letters, vectors are bold lowercase letters, and matrices are bold
uppercase letters. Fixed values, like data, are generally represented by
Latin letters and parameters are written in Greek letters. Using this
notation, the linear mixed model for estimating daily historical observer
mean mid-day July temperature is

To facilitate parameter estimation in the presence of sparse data, the
calibration model borrows strength among days, sites, and years within the
historical observer data for the month of July, reducing the influence of
measurement error and improving prediction of the mid-day diurnal temperature
curve. By borrowing strength, the calibration model produced a mean mid-day
estimate that has less variability than the raw historical observer data. We
fit the calibration model to the historical data using

After aligning the two data sources to a common temporal scale, we
constructed a modeling framework to perform our
reconstruction. One method commonly used for the reconstruction of
paleoclimate is principal component regression (PCR), often called
empirical orthogonal function (EOF) regression in the paleoclimate
literature

To build our spatio-temporal predictive model, we used traditional principal
component regression (PCR) as well as probabilistic principal component
regression (pPCR) that assumes the empirical principal components are a noisy
measure of the true, latent principal components

We introduce traditional PCR in Sect.

A common statistical approach for reconstruction of the historical climate
using current-era analog data is to regress the partially observed historical
observer data onto the current-era analog observations. For a given
reconstruction year, define the regression model

In Eq. (

Preexisting research suggests that truncation of the trailing principal
components is not always appropriate because the higher-frequency components
are often important predictors

An alternative to variable selection methods like SSVS is penalized
regression

The

PCR assumes the data

After accounting for the measurement uncertainty in our predictor matrix

The historical observer data were collected using non-standard methods; thus,
there is likely more variability in the data than can be explained by
assuming a Gaussian error distribution. We propose extending
Eq. (

The latent principal components

For each of the eight candidate models, we fit four parallel chains with
random initial conditions, running 20 000 iterations per chain and
discarding the first 10 000 iterations as burn-in. Fitting all eight models
and the associated post-processing took approximately 18 h on a 2014
dual-core 2.6 GHz MacBook Pro with 8 GB RAM. We thinned our chains every 10
iterations to reduce post-processing time, resulting in a total of 4000
samples and evaluated model convergence using the

To evaluate model performance, we apply scoring rules to the estimated
posterior predictive distributions. A highly desirable property of a scoring
rule is propriety

The use of MSPE as a scoring rule implies an

An alternative to MSPE is the CRPS scoring rule. CRPS is proper, utilizes the
full posterior predictive distribution, and allows for a direct comparison of
point predictions and probabilistic predictions

We can estimate the CRPS after obtaining posterior samples

An alternative is to use the approximate leave-one-out cross-validation
method (LOO;

We use the deviance scale and set

Simulation study showing observed noisy historical observer data

With paleoclimate data, it is difficult to verify the predictive ability of models using cross-validation. With only a handful of observations in the historical observer data available for each year, cross-validation techniques could be highly biased due to the effects of unusual observations in small sample sizes. This is important because we expect noisy and potentially outlying observations in the historical observer data due to the data collection procedures. Additionally, the high dimensionality of the field we aim to reconstruct and the use of computationally intensive MCMC estimation make cross-validation costly. Instead, we conducted a simulation study to explore the different models for the historical observer station data and evaluate model performance using the scoring rules above. Although we do not simulate from the model that is used for estimation, the simulated data represent a reasonable approximation to mid-day July temperature, providing an environment for model testing and exploration of empirical performance.

We simulate mid-day July temperature in one spatial dimension (we extend to
two dimensions using the real data), allowing for faster computation and
easier graphical exploration of the spatio-temporal process. We simulate

We include a spatially correlated random effect

To create observations that match the temporal irregularities and spatial
clustering behavior in the historical observer data, we sample the
one-dimensional spatial field using weighted probabilities that generate
clustered observations in space, storing the simulated temperature
observations at the

Simulation experiment scores. Smaller values indicate better model performance.

We compare the performance of each model specification using MSPE, CRPS, and
LOO scoring rules in our simulation where the best model is the one with the
smallest score. We fit the PCR and pPCR models using SSVS and LASSO
regularization with both the Gaussian and robust Student's

Predicted versus simulated temperatures for the robust PCR and robust pPCR
models using LASSO regularization are shown in Fig.

Simulation truth plotted against predicted temperature for the robust PCR LASSO model on the left and the robust pPCR LASSO model on the right. Predictions for each simulated year are given different colors and the clustering of colors represents annual-scale changes in the mean temperature surface. Results are shown for the LASSO model and the SSVS model performs similarly.

Historical observer reconstruction scores. Smaller values indicate better model performance.

After exploring the model framework using a simulation study, we applied our
models to the historical observer data. We fit the eight models to the data
and present the results from LOO in Table

With the outlier removed, the best predictive models are robust PCR and
robust pPCR due to having the smallest LOO scores (see Table

Historical observer station data LOO Pareto shape estimates with

Reconstruction of mean mid-day July temperature using the robust PCR
model for 4 years. Figures show posterior predictive mean

Reconstruction of the mean mid-day July temperature using the robust
probabilistic PCR model for 4 years. Figures show posterior predictive mean

Posterior predictions of a time series of mid-day July temperature
with associated 95 % credible intervals at Champaign, Illinois, Detroit,
Michigan, Madison, Wisconsin, and Minneapolis, Minnesota

To visualize our results, we plot reconstructions of 4 years of the
historical temperature surfaces using the robust PCR model
(Fig.

By using the spatial structure in the current-era analog data, we generated
temperature predictions at unobserved locations with corresponding
uncertainties. We chose four locations, Champaign, Illinois, Detroit,
Michigan, Madison, Wisconsin, and Minneapolis, Minnesota, and show the time
series of temperature predictions in Fig.

There are many challenges inherent in modeling paleoclimate data. Due to the lack of direct measurements of climate, paleoclimate reconstructions must rely on sparse, noisy proxies of climate. The nuances of paleoclimate data often require specialized modeling techniques and careful investigation into modeling assumptions and performance. In addition, care is needed to properly validate paleoclimate reconstruction skill. In summary, we extended principal component regression methods, applied regularization techniques to choose important principal components, developed robust models to account for the presence of outliers, and explored the use of a probabilistic principal component model to account for measurement uncertainty in the spatially rich current-era analog data. By rigorously evaluating the predictive skill of our models, we were able to explore our extensions of PCR for climate reconstruction, laying the groundwork for future developments with more complex climate data than PRISM temperature surfaces. The models presented in this paper would be good candidates for modeling climate variables that are strongly non-stationary and non-Gaussian (e.g., wind speed or precipitation), but these extensions are the subject of ongoing research.

Within our modeling framework, we presented a simulation study for evaluating paleoclimate reconstructions using proper scoring rules. By using proper scoring rules and exploring model performance in a simulation framework, we have stronger support for the quality of the reconstruction. We presented three statistical scoring rules and explored their strengths and weaknesses. MSPE is a commonly used and easy to understand scoring rule, but is not proper in general and only uses a point prediction, ignoring the probabilistic inference that is gained by using Bayesian techniques. The CRPS is proper and allows for direct comparison of point predictions and probabilistic predictions, but requires out-of-sample validation data or computationally expensive cross-validation. The use of MSPE and CRPS scoring rules allowed for exploration of the empirical properties of the computationally efficient LOO approximation to leave-one-out cross-validation. Our use of LOO to score the historical observer period model predictions not only enabled us to perform model selection, but also aided in diagnosing an outlying observation and refining model fit.

The methods presented in this paper could be applied to other historical datasets at different locations around the world, further extending the spatially explicit empirical record of climate further back in time while rigorously accounting for uncertainties. The methods we presented could also be extended to model temperature and precipitation for each month of the year by including a seasonal component in the calibration model and by modeling dynamics at appropriate timescales. There are many datasets that could be used within this framework as the current-era analog, including modern satellite data. The different options of current-era analog datasets present a trade-off between the number of records available as analogs and the quality of the data. If there are occasionally rare climate processes that occur, it seems that a longer record of climate analogs would be preferred. If the climate processes are relatively stable in time but vary highly in space, a shorter and more precise modern dataset that is not model interpolated might be preferred. In addition, use of highly precise current-era analog data could reduce or eliminate the need to account for measurement error in the current-era analog data.

Ultimately, our temperature reconstructions extend the climatological record in the Upper Midwestern US further into the past. These temperature reconstructions, with their associated uncertainties, can be used to gain better understanding of the influences of climate on the biological and ecological processes observed in the region. By backcasting mean mid-day July temperature with our models, we gain the potential to better understand how climate has changed, and this knowledge could be used to improve future climate reconstructions. Many of the techniques and methods we used – modeling principal components with a probabilistic model, hierarchical pooling to borrow strength among years with sparse and dense data, model selection and regularization, and proper model evaluation – can be adapted and used in future climate reconstruction problems.

The historical observer data are available from

The authors declare that they have no conflict of interest.

This paper was greatly improved based on the comments from the two anonymous
reviewers and the associate editor. The authors also thank Ben Bird, Kristin
Broms, Brian Brost, Franny Buderman, Trevor Hefley, and Henry Scharf for
their conversations and input on this work. This research is based upon work
carried out by the PalEON Project (paleonproject.org) with support from the
National Science Foundation MacroSystems Biology program under grant no.
DEB-1241856. Any use of trade, firm, or product names is for descriptive
purposes only and does not imply endorsement by the U.S. Government. Code and
data found in this paper can be accessed at