2.3.1 Gaussian mixture models

Gaussian mixture models assume that the patterns origin from a probability density
*p*() (Bishop, 1995). This density is a linear combination of Gaussian functions
*p*(| *j*),

The normalization constant

The density
*p*() is a weighted sum of the local densities
*p*(| *j*),

p() = P(j)p(| j) . |
(2.17) |

To normalize

The goal of the mixture model is to find the unknown parameters , , and the priors *P*(*j*) for each unit *j* such that the likelihood,
*L* = *p*(), to obtain the distribution
{} given the density
*p*() is maximal (Bishop, 1995). To solve this optimization problem it is common to use a variant of the expectation maximization algorithm (Bishop, 1995). It consists of two steps with iterate until convergence is reached. In the expectation step, the soft-assignment
*P*(*j*|) for all *j* and *i* is computed based on a given estimate of the parameters , , and *P*(*j*).
*P*(*j*|) is called posterior probability. It is computed using Bayes' theorem (see appendix A.1),

In the special case of uniform Gaussians that all have the same width and weight

In the maximization step, the Gaussian's parameters , , and *P*(*j*) that maximize the likelihood given all
*P*(*j*|) can be directly computed (Bishop, 1995). The result is that the center is the weighted mean of the set
{},

= , | (2.19) |

and the matrix is the inverse of the weighted covariance matrix ,

The inverse can be computed by extracting all eigenvectors of . Thus, the axes of the mentioned hyper-ellipsoid are the principal components of the local data distribution. The size of this hyper-ellipsoid is given by the eigenvalues from the PCA (the semi-axis length of unit

P(j) = P(j|) . |
(2.21) |

It can be shown that alternating these expectation and maximization steps increases the likelihood *L* in each iteration step (Bishop, 1995). However, local maxima are not avoided. Further, single isolated data points (outliers) can make the algorithm unstable (Archambeau et al., 2003). If just one pattern is assigned to a unit (that is, the other patterns have almost zero
*P*(*j*|)) the variance of the local Gaussian collapses to zero. As an improvement to the local minima problem, annealing schemes (as discussed for the vector quantization, section 2.2) were suggested (Albrecht et al., 2000; Meinicke and Ritter, 2001; Meinicke, 2000). Here, a global variance linked to the width of each Gaussian is gradually reduced during annealing.

The Gaussian mixture model as it is presented here has the disadvantage that all eigenvectors of the local covariance matrix need to be extracted. However, this problem can be overcome if probabilistic PCA is used instead of standard PCA.

2005-03-22