Instrumental climate records of the last centuries suffer from multiple breaks due to relocations and changes in measurement techniques. These breaks are detected by relative homogenization algorithms using the difference time series between a candidate and a reference. Modern multiple changepoint methods use a decomposition approach where the segmentation explaining most variance defines the breakpoints, while a stop criterion restricts the number of breaks. In this study a pairwise multiple breakpoint algorithm consisting of these two components is tested with simulated data for a range of signal-to-noise ratios (SNRs) found in monthly temperature station datasets. The results for low SNRs obtained by this algorithm do not differ much from random segmentations; simply increasing the stop criterion to reduce the number of breaks is shown to not be helpful. This can be understood by considering that, in case of multiple breakpoints, even a random segmentation can explain about half of the break variance. We derive analytical equations for the explained noise and break variance for random and optimal segmentations. From these we conclude that reliable break detection at low but realistic SNRs needs a new approach. The problem is relevant because the uncertainty of the trends of individual stations is shown to be climatologically significant also for these small SNRs. An important side result is a new method to determine the break variance and the number of breaks in a difference time series by studying the explained variance for random break positions. We further discuss the changes from monthly to annual scale which increase the SNR by more than a factor of 3.

Relocations of climate stations or changes in measurement techniques and procedures are known to cause breaks in climate records. Such breaks occur at a frequency of about one per 15 to 20 years and the break sizes are assumed to follow a normal distribution (Menne and Williams Jr., 2005) with a standard deviation of about 0.8 K (Auer et al., 2007; Menne et al., 2009; Brunetti et al., 2006; Caussinus and Mestre, 2004; Della Marta et al., 2004; Venema et al., 2012). It is obvious that a few of such breaks have the potential to introduce large errors into the station temperature trends observed during the last century.

Numerous homogenization algorithms exist aiming to detect and correct these breaks. Benchmarking studies (Venema et al., 2012; Williams et al., 2012) analyze the quality of homogenized datasets and as such they consider whole homogenization algorithms, whereas in this study we concentrate on the detection only. The overall performance is investigated by simulated data that model as accurately as possible both the natural variability and the statistical properties of the hidden breaks. Venema et al. (2012) presented the results of the COST Action HOME, which tested the skills of the most commonly used methods and state-of-the-art algorithms by a prescribed fixed assessment procedure. Nearly all of them were relative methods that use the difference either to a composite reference or to a neighboring station to reduce the natural variability that otherwise would conceal the breaks.

The concrete implementation of the various methods differs, but the principal design of the methods is similar and comprises either two or three steps. The first step is the detection of breaks. Here, the difference time series is decomposed into subsegments with maximally different means. For pairwise methods, an intermediate step, the so-called attribution, follows, where detected breaks of the difference time series are assigned to one of the involved stations. Finally, the break sizes for each station and break are determined by a comparison with the neighbors. Although a simultaneous correction of all breaks is more accurate (Domonkos et al., 2013), most algorithms analyzed by Venema et al. (2012) correct break by break beginning today and moving backward in time, such as PHA (Menne et al., 2009), iCraddock, (Craddock, 1979; Brunetti et al., 2006), AnClim (Štépanek et al., 2009), and the various SNHT variants (Alexandersson and Moberg, 1997) that participated.

HOME recommended five homogenization methods: ACMANT (Domonkos, 2011), PRODIGE (Caussinus and Mestre, 2004), MASH (Szentimrey, 2007, 2008), PHA, and Craddock. These methods have in common that they have been designed to take the inhomogeneity of the reference into account, either by using a pairwise approach (PRODIGE, PHA, Craddock) or by carefully selecting the series for the composite reference (ACMANT, MASH). Furthermore, most of these methods explicitly use a multiple breakpoint approach (ACMANT, PRODIGE, MASH).

As mentioned above, we focus in this study on the break detection of such modern multiple breakpoint methods (Caussinus and Mestre, 2004; Hawkins, 2001; Lu et al., 2010; Picard et al., 2005, 2011). While Lindau and Venema (2016) concentrated on the errors in the positions of the breaks, here we analyze the deviation of the estimated inhomogeneity signal from the true one. We consider the difference time series between two neighboring stations as raw information consisting of two components: the break and the noise series. The climate signal is cancelled out, because it is assumed to be the signal both stations have in common, whereas any local deviation from the climate is treated as noise. The main task of homogenization algorithms is to filter out the break part, which can be considered as a signal. Obviously, the task becomes more difficult for low signal-to-noise ratios (SNRs).

The number of breaks is normally determined by considering the likelihood of falsely detecting a break in white noise (Lindau and Venema, 2013). The key idea of this paper is that both the break and noise variance need to be considered. We will show that about half of the break variance is explainable even by random break positions. If the noise is large, the total maximum variance is often attained when the break positions are set in a way that most of the noise is explained. The large amount of additionally (but just randomly) explained break variance suggests erroneously that significant breaks have been found. The algorithm correctly detects that the series contains inhomogeneities, but the errors in the positions can be large. In this paper, we use a basic detection algorithm, consisting only of the two main components: optimal segmentation and a stop criterion. We test the performance of this multiple breakpoint algorithm concerning the detection and its ability to stop the search using simulated data with the same SNR as is typically found in observed temperature data.

The paper is structured in the following way. Section 2 presents the used
observations, their processing, and the method to determine the best
neighbor in order to build pairs for the difference time series. In Sect. 3 we show that breaks in climate series are indeed a relevant problem in the
real world or at least for the analyzed German climate stations. Section 4
describes the applied break search method. In Sect. 5 we distinguish
between break and noise variance, and derive four formulae, which describe
the behavior of both variance parts (noise and breaks) for two scenarios:
for optimum and arbitrary segmentations. These findings are used in Sect. 6
to estimate the range of SNRs found in real data. In Sect. 7 we use this
range and derive theoretically why we expect that the break search method
must fail for low SNRs. In Sect. 8 we generate simulated data with
realistic SNRs to follow the process of finding the optimum segmentation. A
skill measure to assess the quality of segmentation is presented. For
realistic SNRs of

This study consists mainly of general considerations about the segmentation approach used in homogenization algorithms. Mostly, we will confirm our findings by simulated data with known statistical properties. However, in order to use the correct settings for these properties, we also analyze real observations.

For this purpose, we use data from German climate stations (Kaspar et al., 2013) provided by the DWD (Deutscher Wetterdienst), which report the classical weather parameters, e.g., air temperature, air pressure, and humidity, three times a day. These data are aggregated to monthly resolution. The analysis is restricted to the period 1951 to 2000. This period is expected to have relatively few inhomogeneities and has a high station density. Before 1951, the spatial data density was much lower and nowadays many stations are closing due to funding cuts. In this way, our database comprises 1050 stations with 297 705 average monthly temperature observations in total.

First, normalized monthly anomalies are calculated for each station by
subtracting the mean and dividing by the standard deviation of the monthly
mean temperature (both for the period 1961 to 1990). In this way the
variability of the annual cycle is almost completely removed, which would
otherwise dominate. Using the obtained normalized anomalies we build the
difference time series against the best neighbor station, which needs to be
within a range of 100 km and have at least 200 overlapping monthly
observations. As mentioned above, difference time series are necessary to
cancel out the large natural variability, which would otherwise dominate. We
use two criteria to select the best neighbor

Providing reliable secular trends of meteorological parameters is one of the major tasks in climatology (Lindau, 2006). In the following we analyze the DWD station data to show that there are actually problems in determining the long-term temperature trend, which are obviously caused by inhomogeneities. So this section can be seen as motivation that homogenization does actually matter. Trends are calculated using linear least squares regression. To make it easier to appraise the climatological relevance of the trends, the anomalies are not normalized by the standard deviation when computing trends.

We start with calculating the trends of the difference time series between two neighboring stations. In case of trends we increase the requirements and take only pairs into account with a fully covered baseline period 1961 to 1990. This reduces the database to 171 642 observations at 316 station pairs.

It is important that the difference between two neighboring stations is considered here. Due to their proximity the climate signal is expected to be similar at both stations and is nearly entirely cancelled out in the difference time series. Therefore, usually we can assume that the difference series consists of noise plus inhomogeneities, so that we can attribute any significant deviation of the trend from zero and any serial correlation of the difference data to the inhomogeneities. This assumption is the basis for relative homogeneity tests. If the assumption that the climate signal is largely cancelled out does not hold true, any true trend remaining in the difference time series will be treated as a gradual inhomogeneity which will then be corrected by mistake. Thus, significant trends in the difference time series of neighboring stations indicate either the presence of inhomogeneities or large problems in applying relative homogenization at all. For Germany the former is likely the case.

Figure 1 shows the difference time series of monthly temperature anomalies
between Aachen and Essen for the period 1950 to 2000. The 2-year running mean
shows a distinct deviation from zero in the late 1970s, which already
provides some indication of an inhomogeneity. The most striking feature,
however, is the strong linear trend of 0.564 K per century, which is not
expected for homogeneous difference time series. Assuming independence, i.e.,
no inhomogeneities, the standard error in the trend estimation is as low as
0.074 K century

Difference time series of the monthly temperature anomaly between
stations Aachen and Essen. The two stations are separated by 92 km and their
correlation coefficient for monthly means is 0.989. The thick line denotes
the 2-year running mean; the thin line is the linear trend. The error
variance of the trend assuming (probably erroneously) independent data is
calculated by

We repeated the procedure for 316 station pairs in Germany (Fig. 2). The
short vertical lines give the (much too low) uncertainty assuming temporal
independence of data. The horizontal line in the middle denotes the mean over
all stations. The upper and lower lines show the actually obtained standard
deviation of all station pair trends. The mean trend is near zero, as
expected for differences. The sign is not relevant here, as simply exchanging
the order within the pairs would reverse it. The interesting feature is the
standard deviation of the difference trends, representing the true
uncertainty of the trends. Averaged over all the station pairs in the German
climate network, the trend in the difference series is
0.628 K century

Trends for the difference time series of 316 station pairs (from 1050 stations) in Germany. The standard deviation is 0.628 K per century, much higher than expected for homogenous independent data (short vertical lines).

As two stations could be contributing to the inhomogeneities, the standard
deviation contributed by one station equals the standard deviation of the
difference series divided by the square root of 2. Thus, the trend error
caused by inhomogeneities in one station data series is about
0.4 K century

In order to identify the obviously existing inhomogeneities, we use the
following break search method. We split the considered difference time series
(of length

However, the maximum external variance

The difference time series of a climate station pair can be regarded as two superimposed signals. Firstly, the pure break signal, modeled as a step function. Secondly, a short-term variance produced by weather variability and random observations errors, which both lead to random differences between the stations. The break signal is the signal to be detected. Assuming it to be a step function is a good approximation even in case of gradual inhomogeneities, which in the presence of noise can be modeled well by several steps; see the PRODIGE contributions in Venema et al. (2012). The noise stems from measurement errors and differences in the local weather. The weather noise can be autocorrelated, for example in regions influenced by El Niño, but in most cases these correlations are low and do not influence the result much. The HOME benchmarking study used both independent and autocorrelated simulated data and the autocorrelated data were more difficult to homogenize, but not much. For simplicity we thus apply the common assumption that the noise is independent. For both break and noise part we will give formulae that describe the external variance (i.e., explained by the segmentation). This is done twice, for random and optimum segmentations, so that we will finally obtain four formulae.

Lindau and Venema (2013) discussed the noise part of the variance in detail.
Assuming Gaussian white noise with the variance

External variance as a function of the tested number of breaks for
random (0) and optimum segmentations (

However, in break search algorithms the optimum segmentation (e.g., derived
by optimal partitioning) is relevant rather than the mean behavior. It is
obvious that the best of a huge number of segmentations is able to explain
more than an average or random segmentation. Lindau and Venema (2013) found
that the external variance of the optimum segmentation grows with

So far we have discussed the behavior of the noise part of the time series.
The next important question is how the signal or break part behaves. For pure
breaks without noise we assume a step function with a constant value between
two adjacent breaks (Fig. 4). Tested segment averages are the weighted means
of such (few) constant periods. This is a similar situation to random noise,
only that fewer independent data are underlying. Obviously, the number of
breaks

An example of a noise-free time series containing seven true breaks
(thin step line) is tested with three random breaks denoted by the thick
vertical lines. The resulting averages for each test segment are given by
thick horizontal lines. Their variation defines the explained variance. At
the bottom of each test segment the number of independent values is given.
The total number of independents is equal to 11, or, more generally,

Empirical estimates of the external variance as a function of the
normalized number of tested breaks for random (0) and optimum (

To check our assumption, we used simulated data of length

A further difference
occurs for the random segmentations of the break signal. Please compare the
two lower curves in Figs. 3 and 5. In contrast to the formula found for the
noise part

The results of the simulations are given as crosses and circles in Fig. 5, confirming the validity of Eqs. (7) and (8) that both are given as curves for comparison.

In the following we provide an interpretation of Eq. (8) and an explanation
why it differs from Eq. (5). We start with Eq. (5), which states that in case
of pure noise the external variance decreases with increasing

Let us now use this finding to interpret Eq. (8). In a time series containing
only breaks and no noise, there are originally

With Eqs. (5) to (8), we are able to describe the growth of four types of
external variance as a function of the tested break number

The break variance

We take an observed difference time series and test how much variance is
explained by randomly inserted breaks. This is performed by calculating the
external variance

Estimation of break variance (0.226) and number (3) for the station pair Ellwangen-Rindelbach minus Crailsheim-Alexandersreut.

The technical details we used to fit Eq. (10) to the data are the following.
We search for the minimum in

The above-described procedure, so far shown in Fig. 6 for only one station
pair, is now applied to 443 station pairs in Germany (Fig. 7). In some cases
the algorithm yields negative values for the normalized break variance. For
stations without any inhomogeneity the correct retrieval would be zero break
variance. Small random fluctuations of the noise variance can cause a
spurious small value for the estimated break variance which may randomly be
either positive or negative. Thus, these results are indeed unphysical, but
difficult to avoid in statistical approaches with error-affected output.
Omitting or setting them simply to zero would bias the result, so that we
included them without correction when means over all stations are calculated.
On average about six breaks are detected. Please note that we analyzed the
difference time series of two stations. Therefore, the double number of
breaks arises here. For a single station only about three breaks in 50 years
is the correct measure, which is in good agreement with the break frequency
(one per 15 to 20 years) found by Menne and Williams (2005). The second
target parameter, the break variance fraction, is given on the ordinate of
Fig. 7 and is equal to about 0.2 when averaged over all station pairs. Thus,
the mean ratio of break and noise variance can be estimated to

Estimation of the break variance and the number of breaks for monthly mean temperature. The estimation is given for 443 different German station pairs. The vertical line denotes the mean break number (6.1) found for all stations, and the horizontal lines mark zero variance and the average explained normalized variance (0.22), respectively.

In statistical homogenization a range of SNRs will occur. For many climate
parameters the SNR can be expected to be even smaller than

With Eqs. (5) to (8) we have a tool to retrace the process of break detection
in a theoretical manner. Key parameters are the signal-to-noise ratio

Applying a reasonable break search algorithm for an increasing number of

However, break search algorithms always contain a stop criterion, which may
in this case reject any segmentation at all, in this way preventing these
wrong solutions. The argument of the Caussinus–Lyazrhi stop criterion given
in Eq. (4) becomes zero for

So far, we considered only the two extreme cases. On the one hand the
completely wrong solution, which decomposes the noise optimally and gains
some extra break variance just by chance; and on the other hand the
completely correct solution, where it is vice versa. We showed that the false
combination explains on average more variance and is therefore preferred in
the discussed case where the SNR is as small as

For this purpose, we created an ensemble with 1000 simulated time series of
length 100 with seven breaks at random positions. Both noise and break
variance are Gaussian, with a magnitude of unity for the breaks and 9 times
larger for the noise, so that the signal-to-noise ratio is

A second feature in Fig. 9a needs to be discussed. The total explained variance (4) is larger than the sum of explained break and noise variance (3). As the best segmentation (in terms of explaining the total variance) is always chosen, solutions dominate, where (the external) break and noise variances are slightly correlated. In order to explain a maximum of variance it is advantageous to cut the time series in such a way that both break and noise variance are high. In this way correlated segments are preferred. These correlations enhance the average total explained variance further. It is, however, apparently not strong enough to exceed the stop criterion, given by the upper thin line in Fig. 9a.

However, so far we have considered only the means over 1000 realizations. But these solutions are varying, so that the threshold is often exceeded, at least for low break numbers. In Fig. 9b we show only curve (4), the total explained variance, but added as whiskers the 1st and 9th deciles to give an impression of the variability of the solutions.

In Sect. 6 we found that the SNR is less than

In multiple breakpoint methods the maximum explained variance determines the position of the breaks. In this section we want to make the case that this variance may not be a good measure at low SNRs, while it works well for large SNRs.

Illustration of the skill measure M2, the mean squared deviation
between estimated and true break signals.

Consider the difference time series of two neighboring stations. One part
consists of the inhomogeneities that we want to detect. This time series
component is the signal. Figure 10a shows a simulated example of such a time
series. We inserted seven breaks with a standard normal distribution at
random positions. In reality, the detection of the breaks is hampered by
superimposed noise, which is caused by observation errors and different
weather at the two stations. To simulate this, we added random noise (Fig. 10b) with a standard deviation of 2, which corresponds to a SNR of

M1: the total variance explained by test breaks in the noisy data;

M2: the mean squared deviation between the estimated and true signals.

As a first approach, we take one simulated time series (length 100 including
seven true breaks at random positions with SNR

Mean squared signal deviation (M2) against explained variance of the
noisy data (M1). These two measures are given for random segmentations of
simulated time series of length 100 with SNR

In Fig. 11b we repeat the exercise for 100 instead of only 1 time series to not be dependent on just one single time series with possibly extraordinary features. The solution cloud shows again that the correlation between M1 and M2 is low. Now we have 100 circles and 100 crosses for the maximum explained variance and the really best solutions, respectively. For a better visibility the rest of the cloud is omitted in Fig. 11c. It shows that circles are generally located higher than crosses, indicating their lower skill.

In the next step, we increase the number from 100 to 1000 series. Thus, now we create 1000 time series and test each of them with 1000 random break combinations consisting always of seven breaks (Fig. 11d). Here, only the circles are shown, the normally proposed solutions, determined by the maximum explained variance. The mean of the explained variance over all 1000 of these maxima is 1.546. The corresponding true skill is defined by the position on the ordinate, which is on average 0.881. We can conclude that for a simplistic search (best of 1000 random trials) the explained variance is higher than the originally inserted one (1.546 vs. 1.0), and that the error index (0.881) does not differ substantially from the one actually included in the series (1.0), standing for no skill at all.

In the next step we use optimal partitioning (for now without any stop criterion) to find the maximum explained variance (Fig. 11e), instead of choosing the highest of 1000 random trials. The explained variance increases as the used method is of course more powerful. Now, the explained variance is as high as 2.093. However, also the mean signal deviation increases from 0.881 (Fig. 11d) to 1.278. Such a value, larger than 1, indicates that this is worse than doing nothing. In some sense, we may conclude that the simplistic search (best of 1000) has a better performance than the sophisticated technique of optimal partitioning. This result can be explained as follows. Optimal partitioning indeed provides the optimum result for the maximum variance. But this parameter is only loosely coupled to the true skill. Due to the presence of noise, the estimated signal has a much too large variance and is at the same time only weakly correlated with the true signal, so that also the deviation of the two signals M2 is further increased. The underlying reason for this worsening is that, up to now, we did not include the normally used stop criterion. Instead we searched for the best solution for seven breaks, corresponding to the true number hidden in the time series.

Consequently, we finally added the stop criterion given in Eq. (4). Thus, we
finally applied the full search method described in Sect. 4, which consists
of two steps: first, the maximum external variance is determined by optimal
partitioning for all possible break numbers; then a stop criterion is used to
determine the correct break number (Fig. 11f). Thanks to the stop criterion,
only in 9.8 % of the cases is the final error higher than the original one
(M2

In Fig. 11f the mean signal deviation attains at 0.716 again a value below the no-performance threshold of 1. Also, the mean explained variance is decreased to 0.952. This is due to the introduced stop criterion, which enables the algorithm to produce solutions with fewer breaks. The zero solutions without breaks are concentrated at the point (0.0; 1.0), because no variance is explained and the signal deviation is as large as the signal variance itself. However, at 0.716 the mean skill is still poor.

To decide whether this skill is better than random, we use again the
simulated data discussed above. As in Fig. 11f, we calculate for each of the
1000 random time series the mean squared deviation between the true and
estimated signals for the search method. But now this search skill is
compared to the same measure obtained for a random decomposition (of the same
time series) by seven breaks. The resulting 1000 data pairs of search skill
with its counterpart from a random decomposition are given in Fig. 12a. Zero
break solutions (only produced by the actual search method) can be identified
as a marked horizontal line at

We thus showed that a signal-to-noise ratio of

Thus, for SNR

So far we have considered SNRs of

To study the relationship between SNR and break detection skill, we repeated
the calculations for different SNRs between 0.3 and 1.9. The residual errors
M2 for both random segmentation (circles) and search method (crosses) are
given in Fig. 13b. As a guide for the eye, two thick connecting lines are
drawn. The standard deviations of the two skills resulting from 1000
repetitions are given by vertical whiskers. At SNR

However, the SNR is not a fixed characteristic for a given dataset. It
depends on the temporal resolution in which the data are used. Aggregating
monthly data to yearly resolution, e.g., will increase the SNR but decrease
the sample size. Both SNR and sample size affect the detection power (see the
next paragraph for the latter effect). The reason is that the effect on the
break variance part remains small, whereas the noise part decreases rapidly
by averaging over 12 months: under the (reasonable) assumption that the noise
part of the difference time series is weakly correlated, the variance is
reduced by a factor of 12. To estimate the reduction of break variance, we
can use Eq. (8), setting

These considerations suggest that SNR and series length are mutually
dependent. Averaging monthly data to yearly resolution reduces the length
formally by a factor of 12, while increasing vice versa the SNR by a factor
of about

To study this conjecture we investigate the influence of series length on the
detection skill by repeating the calculations for varying lengths from 100 to
1200, while holding the SNR constant. Figure 13c shows the result for
SNR

For a given SNR there is a minimum series length for reliable analysis. Redoing periodically the changepoint analysis can improve the results as the series lengths usually increase yearly.

Multiple breakpoint homogenization algorithms identify breaks by searching for the optimum segmentation, explaining the maximum of variance. In order to assess the performance of this procedure, we decomposed the total variance of a difference time series into two parts: the break and the noise variance. Additionally, we distinguished between the optimal and arbitrary segmentations. In Eqs. (5) to (8) we give formulae for all four cases, describing how the explained variance grows when more and more breaks are assumed.

With this concept it is possible to determine the SNR of a time series in
advance without having to apply a statistical homogenization algorithm. For
the monthly mean temperature of German climate stations, which are
characterized by a high station density, we found a mean SNR of 0.5, which
corresponds to a SNR on annual scale of about 1.5. Even for such small
inhomogeneities, the inhomogeneity-induced spurious trend differences between
neighboring stations are strong and homogenization important. The
signal-to-noise ratio in earlier periods and non-industrialized countries
will often be lower due to sparser networks. Also, other statistical
properties than the mean (Zhang et al., 2011) will in general have lower
correlations and more noise. For the monthly standard deviation of
temperature, we found a SNR of

For SNRs below 1, the multiple breakpoint search algorithm fails under
typical conditions assuming seven breaks within a time series of length 100.
The reason is that random segmentations are able to explain a considerable
fraction of the break variance (Eq. 8). If the tested number of breaks is
comparable to the true one, they explain about one-half. Consequently, the
estimated breaks are not set according to the true breaks, but to positions
where a maximum of noise is explained. Hereby, the explained noise part is
increased by a factor of 5. Thus, if the noise is large, systematic noise
explanation is unfortunately the best variance maximizing strategy. At the
same time the signal part is small and its explained variance decreases in
return only by a factor of 2, i.e., from 1 for the optimum to

Considering the time series of inhomogeneities as a signal that shall be
detected, we define the mean squared difference between the true and
estimated signals as a skill measure. In the case of simulated data, this
measure can be compared to the explained variance, which is used normally to
select the best segmentation. While for high SNR the two measures are well
correlated, their correlation is weak for a SNR of

A stronger stop criterion to purely suppress the majority of the wrong solutions is shown to be not helpful. The presented new method to estimate the break variance and number of breaks might be useful for a future better stop criterion.

If the SNR becomes larger than 0.5, the situation improves rapidly and above a SNR of 1 break detection performs reasonably. The SNR of the HOME benchmark dataset was on average 1.18 for monthly data (corresponding to about 3.5 for annual data), but the break sizes were found to be too high in the validation of the benchmark (Venema et al., 2012). Our study confirms that a good performance of the tested homogenization method can be expected under such circumstances. For lower SNRs, as we found them in German climate stations on monthly resolution, the results may differ.

In future, the joint influence of break and noise variance on other break detection methods should be studied. One would expect that also other methods would need to take the SNR into account. However, the validation study of several objective homogenization methods by Domonkos (2013) shows that while the multiple breakpoint detection method of PRODIGE is best for high SNR, for low SNR many single breakpoint detection methods are obviously more robust and perform better.

Furthermore, the influence of the signal-to-noise ratio on full homogenization methods, including also the data correction, should be tested. The benchmarking study of HOME has shown that there is no strong relationship between detection scores and climatologically important validation measures, such as trend error or root mean square error. For example, PRODIGE was here among the best methods, but performed only averagely with respect to the detection scores. Thus, the consequences for climatologically important error measures are not trivially obvious.

This study finds that SNR and series length are connected. For sample sizes of 100, it is important to achieve a SNR above one. It would thus be worthwhile to develop methods that reduce the noise level of the difference time series. And of course, in case of low SNRs the use of metadata on the station history will be particularly valuable.

Finally, this study shows that future validation studies should use a (realistic) range of SNRs. The International Surface Temperature Initiative aims at computing the uncertainties remaining in homogenized data and will perform a benchmarking mimicking the global observational network, which therefore includes a realistic range of SNRs (Willett et al., 2014; Thorne et al., 2011).

For legal reasons we cannot publish the exact observations
used in this study, but they are available on request. A newer dataset with
partly different stations is freely available in the Climate Data Centre of
the Deutscher Wetterdienst (

The supplement related to this article is available online at:

The authors declare that they have no conflict of interest.

The work was funded by the Deutsche Forschungsgemeinschaft by grant DFG LI 2830/1-1.Edited by: Francis Zwiers Reviewed by: Michele Rienzner and four anonymous referees