Next: 2.2.1 K-means
Up: 2. Modeling of data
Previous: 2.1.2 Probabilistic PCA
2.2 Vector quantization
Vector quantization describes a pattern set using a reduced number of so-called `code-book' vectors. We assume again that n training patterns
IRd are given. Let m < n be the number of code-book vectors
IRd. In the final state, each training pattern is assigned to one code-book vector. The optimal position of code-book vectors is usually gained by finding the minimum of the sum E of squared distances2.2 between each code-book vector and its assigned patterns,
P(j|) is the probability that belongs to .
For the final state,
P(j|) = 1 if is assigned to , and
P(j|) = 0 otherwise. A pattern is assigned to the code-book vector that has the smallest Euclidean distance to that pattern. Thus, the code-book vectors induce a Voronoi tessellation of space (figure 2.2). In each of the separated regions, the position of the code-book vector is the center-of-mass of the local pattern distribution (otherwise (2.12) cannot be minimal).
Figure 2.2:
Code-book vectors (crosses) and the resulting Voronoi tessellation (dashed lines are boundaries, equidistant from two respective codebook vectors). Training patterns are drawn as gray dots.
|
The difficulty in finding the optimal
{} is that E has many local minima. No general solution exists. Instead, various iterative algorithms exist that find approximate solutions. The algorithms can be divided into those that use hard-clustering and those that use soft-clustering. In hard-clustering,
P(j|) is binary (throughout the iteration), and each can be only assigned to one code-book vector. In soft-clustering,
P(j|) can take any value in the interval [0;1].
The algorithms can be further divided into on-line and batch versions. On-line versions update code-book vectors in each iteration step based on just one (randomly drawn) training pattern (as for the self-organizing map, section 1.5.5). The update rule is usually written as
(t + 1) = (t) + (t)P(j|) - (t) . |
(2.13) |
The change of is in the direction of the negative gradient of (2.12).
(t) is a learning rate, which can depend on time. In contrast, batch versions use all training patterns for each step. Here, the algorithm alternates between computing all
P(j|) based on a given distribution
{}, and optimizing all given
P(j|). This algorithm is a variant of the expectation maximization algorithm (Dempster et al., 1977). The `maximization' step is a minimization of the error E (which maximizes the likelihood, see section 2.3.1).
The following sections describe the hard-clustering algorithm `k-means' and provide more details about soft-clustering algorithms. Two examples are described: `deterministic annealing' and `Neural Gas'.
Subsections
Next: 2.2.1 K-means
Up: 2. Modeling of data
Previous: 2.1.2 Probabilistic PCA
Heiko Hoffmann
2005-03-22