next up previous contents
Next: 2.2.1 K-means Up: 2. Modeling of data Previous: 2.1.2 Probabilistic PCA

2.2 Vector quantization

Vector quantization describes a pattern set using a reduced number of so-called `code-book' vectors. We assume again that n training patterns $ \bf x_{i}^{}$ $ \in$ IRd are given. Let m < n be the number of code-book vectors $ \bf c_{j}^{}$ $ \in$ IRd. In the final state, each training pattern is assigned to one code-book vector. The optimal position of code-book vectors is usually gained by finding the minimum of the sum E of squared distances2.2 between each code-book vector and its assigned patterns,

E = $\displaystyle \sum_{{ij}}^{}$P(j|$\displaystyle \bf x_{i}^{}$)$\displaystyle \left\Vert\vphantom{{\bf x}_i - {\bf c}_j}\right.$$\displaystyle \bf x_{i}^{}$ - $\displaystyle \bf c_{j}^{}$$\displaystyle \left.\vphantom{{\bf x}_i - {\bf c}_j}\right\Vert^{2}_{}$ . (2.12)

P(j|$ \bf x_{i}^{}$) is the probability that $ \bf x_{i}^{}$ belongs to $ \bf c_{j}^{}$. For the final state, P(j|$ \bf x_{i}^{}$) = 1 if $ \bf x_{i}^{}$ is assigned to $ \bf c_{j}^{}$, and P(j|$ \bf x_{i}^{}$) = 0 otherwise. A pattern is assigned to the code-book vector that has the smallest Euclidean distance to that pattern. Thus, the code-book vectors induce a Voronoi tessellation of space (figure 2.2). In each of the separated regions, the position of the code-book vector is the center-of-mass of the local pattern distribution (otherwise (2.12) cannot be minimal).

Figure 2.2: Code-book vectors (crosses) and the resulting Voronoi tessellation (dashed lines are boundaries, equidistant from two respective codebook vectors). Training patterns are drawn as gray dots.

The difficulty in finding the optimal {$ \bf c_{j}^{}$} is that E has many local minima. No general solution exists. Instead, various iterative algorithms exist that find approximate solutions. The algorithms can be divided into those that use hard-clustering and those that use soft-clustering. In hard-clustering, P(j|$ \bf x_{i}^{}$) is binary (throughout the iteration), and each $ \bf x_{i}^{}$ can be only assigned to one code-book vector. In soft-clustering, P(j|$ \bf x_{i}^{}$) can take any value in the interval [0;1].

The algorithms can be further divided into on-line and batch versions. On-line versions update code-book vectors in each iteration step based on just one (randomly drawn) training pattern $ \bf x_{i}^{}$ (as for the self-organizing map, section 1.5.5). The update rule is usually written as

$\displaystyle \bf c_{j}^{}$(t + 1) = $\displaystyle \bf c_{j}^{}$(t) + $\displaystyle \varepsilon$(t)P(j|$\displaystyle \bf x_{i}^{}$)$\displaystyle \left[\vphantom{{\bf x}_i - {\bf c}_j(t)}\right.$$\displaystyle \bf x_{i}^{}$ - $\displaystyle \bf c_{j}^{}$(t)$\displaystyle \left.\vphantom{{\bf x}_i - {\bf c}_j(t)}\right]$ . (2.13)

The change of $ \bf c_{j}^{}$ is in the direction of the negative gradient of (2.12). $ \varepsilon$(t) is a learning rate, which can depend on time. In contrast, batch versions use all training patterns for each step. Here, the algorithm alternates between computing all P(j|$ \bf x_{i}^{}$) based on a given distribution {$ \bf c_{j}^{}$}, and optimizing all $ \bf c_{j}^{}$ given P(j|$ \bf x_{i}^{}$). This algorithm is a variant of the expectation maximization algorithm (Dempster et al., 1977). The `maximization' step is a minimization of the error E (which maximizes the likelihood, see section 2.3.1).

The following sections describe the hard-clustering algorithm `k-means' and provide more details about soft-clustering algorithms. Two examples are described: `deterministic annealing' and `Neural Gas'.

next up previous contents
Next: 2.2.1 K-means Up: 2. Modeling of data Previous: 2.1.2 Probabilistic PCA
Heiko Hoffmann