Next: 2.2.1 K-means Up: 2. Modeling of data Previous: 2.1.2 Probabilistic PCA

2.2 Vector quantization

Vector quantization describes a pattern set using a reduced number of so-called `code-book' vectors. We assume again that n training patterns $\bf x_{i}^{}$ $\in$ IR^d are given. Let m < n be the number of code-book vectors $\bf c_{j}^{}$ $\in$ IR^d. In the final state, each training pattern is assigned to one code-book vector. The optimal position of code-book vectors is usually gained by finding the minimum of the sum E of squared distances^2.2 between each code-book vector and its assigned patterns,

E = $\displaystyle \sum_{{ij}}^{}$ P(j| $\displaystyle \bf x_{i}^{}$ ) $\displaystyle \left\Vert\vphantom{{\bf x}_i - {\bf c}_j}\right.$ $\displaystyle \bf x_{i}^{}$ - $\displaystyle \bf c_{j}^{}$ $\displaystyle \left.\vphantom{{\bf x}_i - {\bf c}_j}\right\Vert^{2}_{}$ .

(2.12)

P(j| $\bf x_{i}^{}$ ) is the probability that $\bf x_{i}^{}$ belongs to $\bf c_{j}^{}$ . For the final state, P(j| $\bf x_{i}^{}$ ) = 1 if $\bf x_{i}^{}$ is assigned to $\bf c_{j}^{}$ , and P(j| $\bf x_{i}^{}$ ) = 0 otherwise. A pattern is assigned to the code-book vector that has the smallest Euclidean distance to that pattern. Thus, the code-book vectors induce a Voronoi tessellation of space (figure 2.2). In each of the separated regions, the position of the code-book vector is the center-of-mass of the local pattern distribution (otherwise (2.12) cannot be minimal).

**Figure 2.2:** Code-book vectors (crosses) and the resulting Voronoi tessellation (dashed lines are boundaries, equidistant from two respective codebook vectors). Training patterns are drawn as gray dots.
$\includegraphics[width=11cm]{vectorquant.eps}$

The difficulty in finding the optimal { $\bf c_{j}^{}$ } is that E has many local minima. No general solution exists. Instead, various iterative algorithms exist that find approximate solutions. The algorithms can be divided into those that use hard-clustering and those that use soft-clustering. In hard-clustering, P(j| $\bf x_{i}^{}$ ) is binary (throughout the iteration), and each $\bf x_{i}^{}$ can be only assigned to one code-book vector. In soft-clustering, P(j| $\bf x_{i}^{}$ ) can take any value in the interval [0;1].

The algorithms can be further divided into on-line and batch versions. On-line versions update code-book vectors in each iteration step based on just one (randomly drawn) training pattern $\bf x_{i}^{}$ (as for the self-organizing map, section 1.5.5). The update rule is usually written as

$\displaystyle \bf c_{j}^{}$ (t + 1) = $\displaystyle \bf c_{j}^{}$ (t) + $\displaystyle \varepsilon$ (t)P(j| $\displaystyle \bf x_{i}^{}$ ) $\displaystyle \left[\vphantom{{\bf x}_i - {\bf c}_j(t)}\right.$ $\displaystyle \bf x_{i}^{}$ - $\displaystyle \bf c_{j}^{}$ (t) $\displaystyle \left.\vphantom{{\bf x}_i - {\bf c}_j(t)}\right]$ .

(2.13)

The change of $\bf c_{j}^{}$ is in the direction of the negative gradient of (2.12). $\varepsilon$ (t) is a learning rate, which can depend on time. In contrast, batch versions use all training patterns for each step. Here, the algorithm alternates between computing all P(j| $\bf x_{i}^{}$ ) based on a given distribution { $\bf c_{j}^{}$ }, and optimizing all $\bf c_{j}^{}$ given P(j| $\bf x_{i}^{}$ ). This algorithm is a variant of the expectation maximization algorithm (Dempster et al., 1977). The `maximization' step is a minimization of the error E (which maximizes the likelihood, see section 2.3.1).

The following sections describe the hard-clustering algorithm `k-means' and provide more details about soft-clustering algorithms. Two examples are described: `deterministic annealing' and `Neural Gas'.

Subsections

Next: 2.2.1 K-means Up: 2. Modeling of data Previous: 2.1.2 Probabilistic PCA

Heiko Hoffmann
2005-03-22