next up previous contents
Next: 2.2.1 K-means Up: 2. Modeling of data Previous: 2.1.2 Probabilistic PCA


2.2 Vector quantization

Vector quantization describes a pattern set using a reduced number of so-called `code-book' vectors. We assume again that n training patterns $ \bf x_{i}^{}$ $ \in$ IRd are given. Let m < n be the number of code-book vectors $ \bf c_{j}^{}$ $ \in$ IRd. In the final state, each training pattern is assigned to one code-book vector. The optimal position of code-book vectors is usually gained by finding the minimum of the sum E of squared distances2.2 between each code-book vector and its assigned patterns,

E = $\displaystyle \sum_{{ij}}^{}$P(j|$\displaystyle \bf x_{i}^{}$)$\displaystyle \left\Vert\vphantom{{\bf x}_i - {\bf c}_j}\right.$$\displaystyle \bf x_{i}^{}$ - $\displaystyle \bf c_{j}^{}$$\displaystyle \left.\vphantom{{\bf x}_i - {\bf c}_j}\right\Vert^{2}_{}$ . (2.12)

P(j|$ \bf x_{i}^{}$) is the probability that $ \bf x_{i}^{}$ belongs to $ \bf c_{j}^{}$. For the final state, P(j|$ \bf x_{i}^{}$) = 1 if $ \bf x_{i}^{}$ is assigned to $ \bf c_{j}^{}$, and P(j|$ \bf x_{i}^{}$) = 0 otherwise. A pattern is assigned to the code-book vector that has the smallest Euclidean distance to that pattern. Thus, the code-book vectors induce a Voronoi tessellation of space (figure 2.2). In each of the separated regions, the position of the code-book vector is the center-of-mass of the local pattern distribution (otherwise (2.12) cannot be minimal).

Figure 2.2: Code-book vectors (crosses) and the resulting Voronoi tessellation (dashed lines are boundaries, equidistant from two respective codebook vectors). Training patterns are drawn as gray dots.
\includegraphics[width=11cm]{vectorquant.eps}

The difficulty in finding the optimal {$ \bf c_{j}^{}$} is that E has many local minima. No general solution exists. Instead, various iterative algorithms exist that find approximate solutions. The algorithms can be divided into those that use hard-clustering and those that use soft-clustering. In hard-clustering, P(j|$ \bf x_{i}^{}$) is binary (throughout the iteration), and each $ \bf x_{i}^{}$ can be only assigned to one code-book vector. In soft-clustering, P(j|$ \bf x_{i}^{}$) can take any value in the interval [0;1].

The algorithms can be further divided into on-line and batch versions. On-line versions update code-book vectors in each iteration step based on just one (randomly drawn) training pattern $ \bf x_{i}^{}$ (as for the self-organizing map, section 1.5.5). The update rule is usually written as

$\displaystyle \bf c_{j}^{}$(t + 1) = $\displaystyle \bf c_{j}^{}$(t) + $\displaystyle \varepsilon$(t)P(j|$\displaystyle \bf x_{i}^{}$)$\displaystyle \left[\vphantom{{\bf x}_i - {\bf c}_j(t)}\right.$$\displaystyle \bf x_{i}^{}$ - $\displaystyle \bf c_{j}^{}$(t)$\displaystyle \left.\vphantom{{\bf x}_i - {\bf c}_j(t)}\right]$ . (2.13)

The change of $ \bf c_{j}^{}$ is in the direction of the negative gradient of (2.12). $ \varepsilon$(t) is a learning rate, which can depend on time. In contrast, batch versions use all training patterns for each step. Here, the algorithm alternates between computing all P(j|$ \bf x_{i}^{}$) based on a given distribution {$ \bf c_{j}^{}$}, and optimizing all $ \bf c_{j}^{}$ given P(j|$ \bf x_{i}^{}$). This algorithm is a variant of the expectation maximization algorithm (Dempster et al., 1977). The `maximization' step is a minimization of the error E (which maximizes the likelihood, see section 2.3.1).

The following sections describe the hard-clustering algorithm `k-means' and provide more details about soft-clustering algorithms. Two examples are described: `deterministic annealing' and `Neural Gas'.



Subsections
next up previous contents
Next: 2.2.1 K-means Up: 2. Modeling of data Previous: 2.1.2 Probabilistic PCA
Heiko Hoffmann
2005-03-22