Air mass classification has become an important area in synoptic climatology,
simplifying the complexity of the atmosphere by dividing the atmosphere into
discrete similar thermodynamic patterns. However, the constant growth of
atmospheric databases in both size and complexity implies the need to develop
new adaptive classifications. Here, we propose a robust unsupervised and
supervised classification methodology of a large thermodynamic dataset, on a
global scale and over several years, into discrete air mass groups
homogeneous in both temperature and humidity that also provides underlying
probability laws. Temperature and humidity at different pressure levels are
aggregated into a set of cumulative distribution function (CDF) values
instead of classical ones. The method is based on a Gaussian mixture model
and uses the expectation–maximization (EM) algorithm to estimate the
parameters of the mixture. Spatially gridded thermodynamic profiles come from
ECMWF reanalyses spanning the period 2000–2009. Different aspects are
investigated, such as the sensitivity of the classification process to both
temporal and spatial samplings of the training dataset. Comparisons of the
classifications made either by the EM algorithm or by the widely used

Contemporary synoptic climatology can be seen as a methodological perspective of climatology that creates and/or uses a classification of atmospheric variables at nearly any spatial or temporal scale to either simplify the climate system into a manageable set of discrete states or to gain a better understanding of how atmospheric variability impacts any climate-related outcome (Lee and Sheridan, 2015). A good overview of synoptic climatology as well as examples of studies can be found in Yarnal (1993), Yarnal et al. (2001), Barry and Perry (2001), Huth et al. (2008), Philipp et al. (2010) or Sheridan and Lee (2013).

Among the various classifications, air masses classically refer to large volumes of air which are fairly horizontally uniform with respect to temperature and humidity at any given altitude. Their thermodynamic features are related to the condition of the sea, land, or ice beneath it (Crowe, 1971). Such a definition varies somewhat from the traditional description of Bergeron (1930), which defines continental/maritime, polar/tropical (cP, cT, mP, mT) air masses according to the surface properties of their source regions (Bergeron, 1930; Willett, 1933). Therefore, air masses are characterized essentially by their thermodynamic character through various temperature and humidity variables (Kalkstein et al., 1996) defined at several lower- and mid-tropospheric pressure levels, possibly also including surface variables or even vertical profiles (Huth et al., 2008). However, additional variables, such as dynamic ones (e.g. sea level pressure, wind field), could be added to characterize missing dynamic behaviours, as in Živkovíć (1995).

This paper focuses on the classification of a large dataset of “atmospheric columns” homogeneous in both temperature and humidity on a global spatial grid and at different times. Such entities are closely related to the notion of air mass, and as such, will be mentioned by this term from now on. They are of great importance, in particular in inverse problems where climate variables as well as the three-dimensional structure of the atmosphere can be estimated from satellite data via inverse radiative transfer models (e.g. Chédin and Scott, 1985; Scott et al., 1999). Such models need to be initialized by a priori information as thermodynamic profiles and surface variables. For that purpose, real profiles and variables are usually considered, in particular those coming from radiosonde reports, as the thermodynamic Initial Guess Retrieval (TIGR) dataset (Chédin et al., 1985; Chevallier et al., 2000). With the technological advance of satellite instruments, such databases need to be enhanced in both sampling and air mass classification. Furthermore, adding prior information directly to the data leads to Bayesian statistics, in which the knowledge of the distribution of each statistic variable is essential, hence a probability point of view adopted here.

Over the past year, use of symbolic data has gained popularity (Bock and Diday, 2000; Diday, 2001; Floriana and Diday., 2003; Billard and Diday, 2012), since, as principal component analysis (PCA), they turn large databases into summarized data with manageable size while keeping useful information through a process commonly called “data aggregation” or “data compression”. This process is often necessary as a step of pre-processing before any classification procedure. Symbolic data change the way in which the description of the data is viewed, since they refer to data which do not only contain values as in classical data, but also have a structure and include internal variations. This is the case with distributions, considered by Schweizer (1984) as the “the numbers of the future”, such as probability density functions (PDFs) or cumulative distribution functions (CDFs).

In the present work, CDFs are used in combination with a probabilistic
classification method based on a Gaussian mixture model that has been used
for example in Vrac et al. (2007), Rust et al. (2010) or Carreau and
Vrac (2011). Such a model relies on the assumption that observations from a
given dataset come from several sub-populations and that the overall
population can therefore be modelled as a Gaussian finite mixture model, or
in other words a mixture of weighted PDFs, each one corresponding to a given
sub-population (Fraley and Raftery, 2002). The main problem consists then in
estimating the parameters of the mixture so that the model best fits the
data, which can be done through two different approaches: (i) the
“estimation approach” focuses on the estimation of the mixture model
parameters usually using maximum likelihood estimation techniques. The most
efficient algorithms rely on the iterative expectation–maximization (EM)
algorithm (Dempster et al., 1977; McLachlan and Krishnan, 2008). An optional
partition can then be obtained by applying the maximum a posteriori
principle; (ii) the “clustering approach” focuses directly on grouping the
entities for classification into a number of clusters such that each cluster
can be seen as a sub-population with a given PDF. In that case, the
algorithms generally used rely on the so-called dynamical clustering
algorithms (Diday et al., 1974), applicable to a multivariate Gaussian
mixture (Symons, 1981; Celeux et al., 1989). These algorithms were also
combined with the EM algorithm to develop a variant of the latter, the
classification EM (CEM) algorithm (Celeux and Govaert, 1992). Another
clustering approach is the widely used

In Vrac (2002) and Vrac et al. (2005, 2011), a mixture of copulas was proposed, providing not only a partition of the atmosphere into air mass classes, but also a probabilistic model describing classes as well as the dependencies between and among temperature and humidity through the so-called copula functions (Schweizer and Sklar, 1983; Nelsen, 1999; Diday and Vrac, 2005). The thermodynamic vertical profiles characterizing each atmospheric situation were first aggregated into four CDF values (two for both temperature and specific humidity). The classification method applied to these four statistical variables was then an extension of the problem of the mixture model to these distribution-valued data, so that multidimensional copulas could be used. The algorithm chosen for this purpose, initialized by a partition based on seven zonal clusters, was a dynamical clustering method (clustering approach). As a first step, the results were validated only for a limited dataset of 1 winter day and short-range projections not exceeding 1 month.

Principle of the methodology used in this paper.

Setting aside copula, this paper aims at consolidating the results of
Vrac (2002) and Vrac et al. (2005, 2011) by proposing a robust air mass
classification method based on a Gaussian mixture model and the EM algorithm
(estimation approach). This method is able to deal with a much larger dataset
covering a decade and providing larger-range projections without having to
use arbitrary a priori information as an initialization strategy in the
absence of a prior reference atmospheric column type classification. The
larger spatio-temporal coverage of the dataset considered here enabled one to
take into consideration more statistical variables for a better description
of the troposphere while studying more thoroughly different aspects that had
not been taken into account previously, such as sensitivity to both temporal
and spatial samplings of the training dataset, consistency with the number of
clusters or comparison with a

The article is organized as follows. Section 2 presents the data to classify, the pre-processing data compression step, the classification method and the choice of the optimal configuration of the model. Section 3 discusses different aspects regarding the quality of the classifications. Section 4 focuses on the interest of the a posteriori probabilities provided by the EM algorithm, and introduces an interpretation of the classes based on a decision tree. Section 5 concludes with a discussion.

Figure 1 outlines the methodology used in this paper. The goal is first to build air mass clusters from atmospheric situations characterized by temperature and humidity probabilistic data only, along with probabilities of the situations belonging to each cluster. The clusters have to be as coherent as possible in terms of temperature and humidity (e.g. hot and wet, or cold and dry air masses) and as different as possible from each other. To achieve this, a representative training dataset of thermodynamic profiles is first turned into a manageable set of CDF values. The latter are used as input into the EM algorithm to obtain a partition of the atmospheric situations as well as a probabilistic description of each group (cluster) of the partition, without having any prior reference classification (a process called unsupervised classification, clustering or cluster analysis). The information contained in this probabilistic description can then be used in a second step to identify to which of the existing groups (classes) any new atmospheric situation belongs, on the basis of the training dataset whose group assignment is already known (a process called supervised classification or simply classification when explicitly compared to clustering).

The atmospheric dataset used in this study is based on ERA-Interim global
atmospheric reanalyses from the European Centre for Medium-Range Weather
Forecasts (ECMWF) covering the period from 1 January 2000 to 31 December 2009
(Dee et al., 2011). The different daily products (e.g. surface pressure,
surface temperature, temperature and specific humidity profiles) are
available on a 0.75

Total column water vapour in precipitable centimetres

Unless explicitly mentioned, the set of observed data corresponding to 00:00
and 12:00 UTC of the 15th day of January, April, July and October from 2005
to 2009 will be used in this article and referred to as “training dataset”.
Each synoptic hour gathers 241

Humidity can be measured by various variables. Here, specific humidity, dew
point temperature as well as total column water vapour will be considered.
They correspond to the ratio of the mass of water vapour to the mass of dry
air plus water vapour (expressed in kg kg

In order to have data homogeneous to temperature, specific humidity from ERA-Interim reanalyses has been converted into dew point temperature, whose relation can be deduced from Buck (1981), so that each atmospheric situation is characterized by temperature and dew point temperature profiles only. The first 38 temperature values (from the surface to 67 hPa) and the first 30 dew point temperature values (from the surface to 230 hPa) from each profile are kept in order to feature the thermodynamic properties of the troposphere. Figure 2 illustrates such thermodynamic variables by representing the total column water vapour and average temperature between the 47th and 37th sigma levels (800 and 425 hPa on sea respectively). The atmospheric situations associated with elevation above sea level higher than 1 km have been discarded, which corresponds here to 13 % of the situations. This pre-filtering puts aside the question of whether the atmospheric situations corresponding to high elevations require a specific handling.

From numerical data to CDF values. The parts of the thermodynamic
profiles (temperature ones here) of given atmospheric situations (four
typical examples here, denoted by A, B, C and D), shown in

The approach proposed here consists in working with CDFs instead of using
classical numerical values of temperature and dew point temperature at
different pressure levels. A cumulative distribution function

Figure 3 illustrates the transformation process from thermodynamic profiles
to CDFs for four example temperature profiles (the method remains the same
with dew point temperature profiles). The selected part of the profiles is
converted into a PDF first and then into a CDF from which a discrete set of
key CDF values is selected as being the most representative information of
the temperature (or dew point temperature) vertical profiles (data
aggregation step). Here, since the general shape of temperature and humidity
distributions is not known a priori, the conversion into PDF is performed
using the non-parametric Rosenblatt–Parzen kernel density estimation (KDE)
method (Rosenblatt, 1956; Parzen, 1962), which can be seen as a weighted sum
of

A priori knowledge is often essential to choose the relevant variables (here, the CDF values) able to distinguish the intrinsic groups in a given dataset. Such a problem is known as “feature selection” or “variable selection”. Some numerical techniques exist in the literature to help make a choice (e.g. Diday and Vrac, 2005; Vrac et al., 2011; Pudil et al., 1994; Raftery and Dean, 2006), but they remain not really relevant for the datasets studied here. Looking at the diversity of the profiles leads one to select subjectively CDF values at several temperature values from 200 to 290 K, and the same for dew point temperature. A set of 10 values within this interval, every 10 K, seems a good compromise between keeping enough information and reducing as much as possible the number of dimensions.

Finally, the database used as input data into the clustering algorithms (EM
or

Model-based clustering relies on the assumption that a given observed dataset
contains several sub-populations and that the overall population can
therefore be modelled as a finite mixture model. Usually, and as done here, a
Gaussian mixture model is used, that is, a mixture of weighted Gaussian PDFs,
each one associated with a given cluster related to a sub-population. Let

The main problem consists then in estimating the mixture model parameters

Result of a 20-dimensional multivariate seven-cluster unsupervised
classification process via EM-VII (isotropic dispersion which can vary
between the clusters) projected in a 2-dimensional subspace, that is, for the
variables

The EM algorithm alternates between the E-step and the M-step. At each
iteration, the parameters

In order not to use arbitrary a priori information, the initialization
strategy used in this paper consists in repeating

As in Banfield and Raftery (1993) and Celeux and Govaert (1995), each covariance matrix

Among the three covariance matrix models which do not lead here to poor
structures in terms of air mass patterns (too much zonal structure or a
preponderance of one cluster over the other ones), only two models, known as
hyperspherical models, will be studied in this paper: the model denoted VII
assumes isotropic dispersion which can vary between the clusters, whereas EII
differs only from the previous model by constraining the dispersion to be
equal between the clusters. The expressions of the

One of the most difficult problems in unsupervised classification is the determination of the optimal number of clusters (unless already known) which must be fixed before performing the clustering process. When the number of clusters is not apparent from prior knowledge, many methods have been established over the years to help make a suitable choice. Several criteria have been tested (e.g. Akaike, 1973; Schwarz, 1978; Raftery, 1995; Hardy, 1996, 2006; Gordon, 1999; Biernacki et al., 2000), including the approximate weight of evidence (AWE) adopted in Vrac et al. (2005, 2011). Some of them are based on maximizing the log-likelihood to which a penalization term is added, depending on the number of independent parameters to estimate for the model selected (the covariance matrix model, here), but all of them lead to similar results. In the end, all the criteria are not able to distinguish which covariance matrix model or number of clusters suit the present data, so that their choice remains suggestive here.

Average and associated standard deviation profiles for temperature (left) and dew point temperature (right) for each cluster obtained with EM-VII. The black line indicates the upper pressure level kept used for computing the cumulative distribution functions.

Main features of the air mass clusters. The percentages have been computed after discarding the high relief atmospheric situations, so that each sum of a given row equals 100 %.

Following Vrac et al. (2005, 2011), the number of clusters is fixed to seven. This choice was motivated by the fact that an odd number of clusters is a priori expected to take into account the natural difference in the mid-latitude and polar air masses between summer and winter (hence, at least four clusters) while favouring a kind of symmetry around the Equator with more than one air mass for the tropics (hence, at least three additional classes).

In this section, unsupervised classification is applied to the training dataset with no spatial sampling. The situations associated with elevations above sea level higher than 1 km will be referred to as air mass 0 from now on to indicate that they have been discarded (Sect. 2.1). The seven resulting clusters are ordered from the average hottest surface temperature to the average coldest one, that is, globally from a tropical air mass (1) to a polar one (7), and can be thermodynamically correlated with the maps shown in Fig. 2. The features of each cluster can be represented for example by their mean and standard deviation profiles for temperature and dew point temperature shown in Fig. 5, and also by the total column water vapour mean and mid-tropospheric layer average temperature (here, 800–320 hPa) listed in Table 1. This table also contains the percentage of situations per cluster for the whole period on which the clustering has been performed, after discarding the high relief situations (Sect. 2.1). Figure 6 shows partitions resulting from EM-VII (a) and from EM-EII (b). These results are discussed in the following sections.

Seven-cluster unsupervised classification with the EM-VII model

The clusters shown in Fig. 6a present relevant thermodynamic homogeneous areas: three tropical/sub-tropical hot air masses which are distinguished essentially by humidity, that is, very wet (1), wet (2) and relatively wet (3) ones; one temperate air mass mixing warm to cool, relatively wet to dry atmospheric situations (4); and three sub-polar/polar air masses corresponding respectively to a relatively cold and dry air mass including northern summer situations (5), a cold and dry one (6) and finally a winter frigid, very dry one (7).

As confirmed by Fig. 5, polar air masses are characterized, as expected, by a
higher variability in temperature and humidity, while air mass 4 acts as a
transition cluster between tropical/sub-tropical air masses and
sub-polar/polar ones. As for air mass 3, it is associated with a strong dew
point vertical gradient in the middle troposphere reflecting areas of dry air
subsidence in both hemispheres between 20 and 35

Comparing these air mass maps to Fig. 2 shows some similarities, particularly
regarding the distribution of the total column water vapour regardless of the
amount of humidity. Tropical situations are precisely depicted, as shown for
instance by humid incursion of air masses 1 and 2 into the drier air mass 3,
spiralling clockwise towards the centre of a depression in the southern
Pacific Ocean between

Percentage of observations per cluster on the whole band of latitude
and per 10

The hottest and wettest air mass cluster 1, particularly, follows closely the
Intertropical Convergence Zone (ITCZ). The latter consists in hot, very wet
air masses meeting together due to the trade winds, and involving very hot
low tropospheric temperatures as well as convective systems consisting in
large-scale thunderstorms when the surface is also wet (oceans, tropical
forests). The slight seasonal shift of the ITCZ location is then visible,
moving annually towards the northern Tropic of Cancer in northern summer and
towards the southern Tropic of Capricorn in northern winter, since the belt
of maximum temperatures migrates as the Earth orbits the Sun. This is also
illustrated in Fig. 7, which shows the percentages of observations per
cluster, or corresponding to high relief (white colour), for the whole band
of latitude (top bar) or per 10

As in Vrac et al. (2005) the discrimination between the air masses as well as their features can also be illustrated by plotting the temperature (and dew point temperature) PDFs representing the distribution of the thermodynamic variable at a given sigma pressure level for each air mass cluster (not shown). This shows for example the behaviour of the first two tropical air masses seen from Figs. 5 and 6a (i.e. overlapping temperature PDFs corresponding to air masses 1 and 2, with well-distinguished dew point temperature PDFs) and the fact that the result of the clustering procedure is mainly due to the mid- and lower troposphere, and, to a lesser extent, to the tropopause, since discrimination between clusters decreases at higher altitudes. This explains the lower temperature variabilities in the lower and mid-troposphere and the higher temperature variabilities around the tropopause observed in Fig. 5.

The second possible kind of classification leading to relevant air masses is
obtained with EM-EII, whose partitions are nearly identical to those obtained
with the widely used

As shown in Fig. 6, the resulting partitions coming from the two covariance
matrix models (EM-EII and EM-VII) are quite different. An arrow diagram (not
shown) close to those described in Huth (1996) indicates that EM-EII air mass

The choice between these two models will be made after studying the sensitivity of the clustering to the choice of the spatio-temporal sampling of the dataset on which the clustering is applied.

A good quality clustering should be relatively insensitive to changes in sample size or spatial and temporal sampling. As mentioned in Sect. 2.1, each synoptic hour (UTC) gathers different local hours spatially distributed over the Earth. Therefore, sensitivity to the temporal sampling is expected to be lower than its spatial counterpart. The former is then studied before the latter.

Depending on the choice of the spatial sampling step on which the atmospheric situations are selected for the clustering process, air masses may be significantly different. In order to know how many years, how many months a year, and so on, should be used, a sensitivity study must be performed. Studies within the period 2000 to 2009 show that resulting partitions are similar as soon as 4 months representative of each season are used. However, there is an exception for 2003, for which the features of air masses 3 and 4 are significantly different from those corresponding to the other years (not shown). In a more general framework, in order to avoid possible singular partitions due to specific thermodynamical features over the years, the training dataset will contain 2 synoptic hours, 1 day, 4 months and 5 years, hence the choice of the training dataset mentioned in Sect. 2.1.

The temporal sampling being adopted, the sensitivity of the clustering to the
choice of the spatial sampling is now studied. The latter is characterized
not only by its longitude/latitude spatial sampling step, but also by its
starting grid point whose choice may also alter the resulting partition. A
spatial sampling step

Clustering sensitivity to spatial sampling step and starting grid
point through misclassification rates (%) for EM-EII (

To evaluate the impact of decreasing the spatial sampling, misclassification
rates

It should be noted that the choice of the sampling in both time and space based on the results found in the present Sect. 3.2 does not change the results presented in Sect. 2.3.3 relating to the choice of the covariance matrix model and of the number of clusters.

From now on, EM-VII will be used since it relies on better physical assumptions (Sect. 2.3.3), depicts more accurately tropical air masses (Sect. 3.1) and has lower sensitivity to the choice of the spatial sampling (Sect. 3.2.2) compared to EM-EII.

Since no criterion was able to help us select the optimal number of air mass clusters, it has been subjectively fixed to seven in this paper (Sect. 2.3.3). However, such a choice may seem rather arbitrary, especially as air mass 4 acts in fact as a coarse transition cluster between tropical/sub-tropical and sub-polar/polar (Sect. 3.1.2). We now focus on the evolution of the classification with the number of clusters.

Dealing with eight clusters involves the separation of the previous air mass
4 associated with the seven-cluster partition, denoted 4

For an easier comparison of partitions for two different numbers of clusters, an arrow diagram can be used such as the one shown in Fig. 10. This figure reflects the fact that the classification is rather consistent with the number of clusters, meaning that a successive increase in the number of clusters (caused by a change in this pre-set parameter) leads to the division of a rather small set of clusters while not changing the other ones, alternating tropical/temperate air masses and polar ones at each successive increase.

Unsupervised classification with EM-VII into 8

Arrow diagram illustrating the correspondence between the

Geographical zoom on 15 July 2006 (00:00 UTC). Supervised
classification map

Decision tree obtained from the seven-cluster partition built with the training database and EM-VII.

In particular, the classification is stable from 7 to 8 and from 12 to 13
clusters, which would indicate at least two suitable range numbers of
clusters to consider as priority. Moving from 7/8 to 12/13 is
straightforward. For example, transition air masses 4

As expected, if the clustering is found to be rather consistent with the number of clusters, the mixture model can hardly reach the perfect consistency provided by hierarchical clustering by definition (Huth, 1996; Huth et al., 2008). Even if some numbers of clusters can be chosen as priority based on the previous figure, confirming our initial choice (seven clusters), the quality of the classification ensures that the choice of the number of clusters mainly depends on the intended objective.

In the following sections, the supervised classification of the atmospheric
situations corresponding to a given synoptic hour is obtained by using the
mixture model parameters estimated via unsupervised classification of the
training dataset with a spatial sampling step of 3.75

Figure 11a shows examples of supervised classification maps for the 15th day of January and July at 00:00 UTC for 1 year outside the 5-year time period of the training dataset, that is, 2002, for which the subtraction between supervised classification and unsupervised classification (using in that case the period 2000–2004 instead of 2005–2009 for the training dataset) is also shown. It is important to notice that air mass patterns are similar to the ones resulting from unsupervised classification at the same day, which is expected since air masses retain their essential features when they are not sensitive to the choice of the spatial and temporal richness of the training dataset. According to Fig. 11b, misclassified situations are mainly located in the narrow regions between the air mass clusters. If this property is verified at the temporal scales studied in this paper, that may not be the case for studies over longer periods spanning several decades.

In contrast to the

Non-zero error probabilities are located at the transition between different
air masses where they take the highest values, but only a few situations are
involved, meaning that air mass classes are rather well separated. Besides, a
plot of the competitiveness index

The boundary regions separating the different air masses, across which the
thermodynamic conditions change rapidly, correspond to meteorological fronts
whose 1 to 3

A plot representing the percentage of atmospheric situations against the number of classes per error probability step (every 0.1 for example) confirms that not only is the number of situations associated with a high error probability low, but also indicates that it slightly increases with the number of classes, since there are more boundary regions between air masses. Furthermore, it shows that the range 7/8 and 12/13 classes mentioned in Sect. 3.3 (as well as 16/17) are associated with a slight decrease in error probability, although the corresponding value is too low to indicate any optimal number of classes (not shown).

Error probabilities can be used for adding one or several transition classes associated with a low level of confidence or for keeping observations whose classification is considered sufficiently representative of the corresponding class features to be associated with some meteorological phenomena. Probability distributions can also be used to enhance a priori information in remote sensing applications, for example.

Another interesting way to interpret a partition is to build a decision tree,
that is, a supervised classification method in the form of a tree structure
separating a dataset into smaller and smaller subsets through some decision
rules given a partition known a priori. The goal of such a tree is to predict
the value of the variable to be explained (here, the class to which a given
observation belongs) given a subset of input explanatory variables (here, the

Separations between the nodes have been performed via maximal impurity reduction, with the use of the Gini index as an impurity function. That means that the tree tries to build nodes containing as few clusters from the reference partition as possible. In order to compare the classes obtained from the decision tree to the reference partition, the tree has been pruned to seven terminal nodes. It is done by setting a complexity parameter measuring the “cost” of adding another explanatory variable among the 20 possible ones in the model underlying the decision tree. For more technical details, see Therneau and Atkinson (2015). Pruning the tree implies that we may not have the same set of explanatory variables as used previously in the EM algorithm.

Figure 13 shows the classification tree obtained by considering as a priori
probabilities of belonging to each cluster the mixture proportions of the
reference partition obtained via unsupervised classification with EM-VII on
the training dataset. These mixture proportions are similar to those listed
in the fourth line of Table 1 for EM-VII. In Fig. 13, CDF values

The most striking feature appearing in Fig. 13 is the fact that temperature
is used first to separate the atmospheric situations into two main groups,
i.e. polar and sub-polar air masses on the one hand, and temperate,
sub-tropical and tropical ones on the other hand. Then, humidity is used to
make the remaining separations in order to obtain a seven-class partition.
This confirms the findings described in Sect. 3 (high correlation between air
mass clustering and humidity). From the decision tree, temperature variables
could be considered less important compared to humidity ones, in terms of the
respective number of variables used to build the tree. However, removing the
temperature CDF value used as the first variable, i.e.

Misclassification rates, i.e. one minus the highest proportion of observations belonging to each cluster, are particularly low. For instance, considering the bottom right group as an air mass class implies that the latter would contain here 17 546 observations coming at 93 % from cluster 7 of the reference partition and at 7 % from cluster 6, and would be associated with a misclassification rate of 7 %. These low values indicate that the clustering process used to create the reference partition is efficient and robust and that resulting partitions make sense.

The same study with EM-EII shows that the resulting decision tree alternates temperature and humidity decision rules (not shown): temperature seems to have a higher importance in the partitioning with EM-EII than with EM-VII, which explains the difference between both types of classification.

In this paper, a methodology for unsupervised and supervised classifications of various and large atmospheric databases into distinct air masses has been proposed and applied to thermodynamic profiles (temperature and dew point temperature) from ECMWF reanalyses. These three-dimensional data are gridded in latitude, longitude and vertical layers, homogeneously distributed over the Earth, and span the period 2000–2009. This methodology follows a similar probabilistic point of view considered by Vrac et al. (2005, 2011) through a different approach to the problem of mixture models (estimation approach against clustering one previously). It relies (i) on a probabilistic classification method based on a multivariate Gaussian mixture model whose parameters are estimated via maximum likelihood estimation by the expectation–maximization (EM) algorithm; and (ii) on the use of probabilistic data: classical thermodynamic values at different pressure levels are converted into a set of cumulative distribution function (CDF) values whose number represents the number of statistical variables needed to characterize each atmospheric situation. This data compression step implies a description of the data different from the common ones, giving information on the vertical distribution of the temperature and dew point temperature values regardless of the successive pressure levels.

In Vrac et al. (2005, 2011), (i) a limited set of observations consisting of only 1 winter day was used as a training dataset for classifying new data through projections not exceeding 1 month; (ii) only four statistical variables were used to characterize each atmospheric situation; (iii) an initial partition based on seven subjective zonal clusters homogeneous in temperature and humidity was used. Such choices were not enough to steadily characterize air masses at any time and any location on large temporal scales. To overcome this problem, several updates have been implemented. First, a much larger set of observations has been selected as a training dataset in order to take into account their high variability, that is, 2 synoptic hours of the central day of 4 months representative of each season for a period covering 5 years. Second, each atmospheric situation has been characterized by a substantially higher number of statistical variables for a better thermodynamical description of the profiles: 10 CDF values for temperature, and 10 for dew point temperature. And third, an initialization strategy for EM based on the use of a suitable random initial partition has been adopted to avoid the use of arbitrarily chosen prior information.

Furthermore, 14 models adding different constraints (or not) to the structure of the covariance matrices and thus to the dispersion of the observations have been studied, since dealing with the unconstrained model does not provide representative partitions. Several criteria have been tested as a selection criterion for both the covariance matrix model and the number of clusters. However, no optimal number of clusters emerges from their evolution. Hence, following Vrac et al. (2005, 2011), seven clusters have been subjectively chosen.

If most of the covariance matrix models imply either too much zonal structure
or a preponderance of one air mass class over the other ones, three of them
lead to relevant air mass spatial regions. These three models are
distinguished by a different relative influence of temperature and humidity
on the classification process, as shown by the use of a decision tree for
helping in the interpretation of the resulting clusters. For instance, the
two models EII and VII assume either equal isotropic dispersion between the
clusters (equivalent in fact to the widely used

The proposed method shows low temporal and spatial sensitivity to the choice
of the training dataset. Within the period 2000–2009, we have not only shown
that the partitions are similar as soon as the training dataset contains 1
day of 4 months representative of each season and spans several years, but
also that a 3.75

Our method complies with the five properties introduced by Huth (1996) and
Huth et al. (2008) to assess the quality of a classification: (i) the method
reproduces expected patterns known to exist in the data, as low-pressure
systems or the traditional winter continental polar (cP) air mass; (ii) it
shows little sensitivity in time and space to the choice of the training
dataset, both in terms of observation selection and size; (iii) it shows
neither high equability (clusters tending to be equal in size) nor low
equability (a huge cluster accompanied by small ones, called the snowballing
effect); (iv) it makes a good distinction between clusters since the boundary
regions separating the air masses, associated with high uncertainty, present
narrow extents not exceeding 3

Based on temperature and dew point temperature variables, the proposed classification method is applicable to most atmospheric datasets used by the atmospheric science community, such as radiosonde measurements, meteorological reanalyses or satellite data. Depending on the intended objective, other variables could also be considered, especially dynamic variables to help monitor air mass movement, such as potential vorticity, which is commonly used for weather analysis (e.g. Emanuel, 2008). An important feature of this method consists in providing probabilistic information, which can be used to provide the uncertainties associated with the classes or improving a priori information in many atmospheric applications such as in remote sensing. Finally, through the evolution of the classes and their associated probabilities along several decades, the method could be easily adapted to evaluate general circulation models and study climate variability and potential changes at different spatial and temporal scales.

The temperature and specific humidity profiles as well as the surface
temperatures and pressures used in this study come from ERA-Interim global
atmospheric reanalyses (ECMWF, 2016) and can be downloaded for example from

The 14 covariance matrix models.

Here we give more details on the eigenvalue decomposition of the covariance
matrices described in Banfield and Raftery (1993) and Celeux and Govaert (1995). Each covariance matrix

Such decomposition leads to 14 parsimonious models (column 1 in the table below) depending on whether some assumptions about the structure of the covariance matrices are added or not.

These models can be classified into three families (column 2): the hyperspherical models (isotropic dispersion), the hyperdiagonal models (coordinate axis-aligned orientation) and the hyperellipsoidal models (free orientation).

These models can be simply indicated by three sequential letters (column 3)
corresponding to the three attributes characterizing the dispersion of the
mixture component distributions, that is, the hypervolume, the shape and the
orientation of their isocontour in the multidimensional space, providing an
easy geometric interpretation of the models. Each letter indicates whether
the corresponding attribute is equal (

For illustrative purposes, the typical isocontours of the mixture component distributions are commonly drawn in a two-dimensional subspace, where the hypervolume, shape and orientation features then correspond respectively to the surface, the major and minor axis ratios, and the orientation of the major axis of the elliptic isocontours. In the case of the two hyperspherical models, elliptic isocontours are reduced to circles.