3.2.1 Algorithm

Next: 3.2.2 Alternative distance measure Up: 3.2 Extension of Neural Previous: 3.2 Extension of Neural

3.2.1 Algorithm

In the extension of Neural Gas to NGPCA, a code-book vector is replaced by a hyper-ellipsoid. An ellipsoid has a center $\bf c_{j}^{}$ , and axes given by a local PCA, which extracts q eigenvectors $\bf w^{l}_{j}$ and eigenvalues $\lambda^{l}_{j}$ with l = 1,..., q. The size of the ellipsoid in the direction of the d - q minor components is given by the mean residual variance $\sigma^{2}_{j}$ in these directions. The algorithm has the same annealing scheme as Neural Gas, and it also has the same parameters. Again, at the beginning of each annealing step, a data point is randomly drawn from the training set. After presenting a point $\bf x$ the centers $\bf c_{j}^{}$ are updated as the code-book vectors in Neural Gas,

$\displaystyle \bf c_{j}^{}$ (t + 1) = $\displaystyle \bf c_{j}^{}$ (t) + $\displaystyle \alpha_{j}^{}$ ^. $\displaystyle \left[\vphantom{{\bf x}- {\bf c}_j(t)}\right.$ $\displaystyle \bf x$ - $\displaystyle \bf c_{j}^{}$ (t) $\displaystyle \left.\vphantom{{\bf x}- {\bf c}_j(t)}\right]$ .

(3.1)

The weight $\alpha_{j}^{}$ is defined by $\alpha_{j}^{}$ = $\varepsilon$ ^.exp(- r_j/ $\varrho$ ). The learning rate $\varepsilon$ and the neighborhood range $\varrho$ decrease exponentially during training. r_j is again the rank of the unit j with respect to $\bf x$ . In the following, the unit index j is omitted for simplicity. All equations need to be evaluated for each unit separately.

The ranking of the units cannot depend on the Euclidean distance anymore, since this would ignore the shape of the ellipsoids. Instead, an elliptical error measure is chosen,

E( $\displaystyle \bf x$ ) = $\displaystyle \xi$ ^T $\displaystyle \bf W$ $\displaystyle \Lambda$ ^-1 $\displaystyle \bf W^{T}_{}$ $\displaystyle \xi$ + $\displaystyle {\frac{{1}}{{\sigma^2}}}$ $\displaystyle \left(\vphantom{\boldsymbol{\xi}^T \boldsymbol{\xi}- \boldsymbol{\xi}^T {\bf W}{\bf W}^T \boldsymbol{\xi}}\right.$ $\displaystyle \xi$ ^T $\displaystyle \xi$ - $\displaystyle \xi$ ^T $\displaystyle \bf W$ $\displaystyle \bf W^{T}_{}$ $\displaystyle \xi$ $\displaystyle \left.\vphantom{\boldsymbol{\xi}^T \boldsymbol{\xi}- \boldsymbol{\xi}^T {\bf W}{\bf W}^T \boldsymbol{\xi}}\right)$ + lndet $\displaystyle \Lambda$ + (d - q)ln $\displaystyle \sigma^{2}_{}$ .

(3.2)

This measure is a normalized Mahalanobis distance plus reconstruction error (Hinton et al., 1997). $\xi$ = $\bf x$ - $\bf c$ is the deviation of the vector $\bf x$ from the center of a unit. The matrix $\bf W$ contains in its columns the eigenvectors $\bf w^{l}_{}$ , and the diagonal matrix $\Lambda$ contains the eigenvalues $\lambda^{l}_{}$ . It can be shown (appendix C.1) that this error measure is the same (up to a constant) as the double negative logarithm of the local probability density given in probabilistic PCA. Thus, a unit can be interpreted as a local density p( $\bf x$ ) = (2 $\pi$ )^-d/2exp(- E( $\bf x$ )/2), and the units are ranked in the order of the probabilities of $\bf x$ originating from the single units. However, the computation of the probability is not necessary since the same result is obtained by ordering the units using the error E( $\bf x$ ). The unit resulting in the smallest E( $\bf x$ ) has rank zero, and the one with the largest E( $\bf x$ ) has rank m - 1.

The second term in (3.2) is the reconstruction error divided by $\sigma^{2}_{}$ . $\sigma$ depends on the estimate of the total residual variance v_res, which is updated according to

v_res(t + 1) = v_res(t) + $\displaystyle \alpha$ ^. $\displaystyle \left(\vphantom{\boldsymbol{\xi}^T \boldsymbol{\xi}- \boldsymbol{\xi}^T {\bf W}{\bf W}^T\boldsymbol{\xi}- v_{\mbox{res}}(t)}\right.$ $\displaystyle \xi$ ^T $\displaystyle \xi$ - $\displaystyle \xi$ ^T $\displaystyle \bf W$ $\displaystyle \bf W^{T}_{}$ $\displaystyle \xi$ - v_res(t) $\displaystyle \left.\vphantom{\boldsymbol{\xi}^T \boldsymbol{\xi}- \boldsymbol{\xi}^T {\bf W}{\bf W}^T\boldsymbol{\xi}- v_{\mbox{res}}(t)}\right)$ .

(3.3)

The total residual variance is evenly distributed among all d - q minor dimensions by

$\displaystyle \sigma^{2}_{}$ = $\displaystyle {\frac{{v_{\mbox{res}}}}{{d-q}}}$ .

(3.4)

This equation is the same as for the noise in probabilistic PCA (2.11). To adjust the principal axes and their lengths, we do one step with an on-line PCA method:

{ $\displaystyle \bf W$ (t + 1), $\displaystyle \Lambda$ (t + 1)} = PCA{ $\displaystyle \bf W$ (t), $\displaystyle \Lambda$ (t), $\displaystyle \xi$ (t), $\displaystyle \alpha$ (t)} .

(3.5)

We use a PCA algorithm similar to the robust recursive least square algorithm (RRLSA) (Ouyang et al., 2000). RRLSA is a sequential network of single-neuron principal component analyzers based on deflation of the input vector (Sanger, 1989). While the $\bf w^{l}_{}$ are normalized to unit length, internally the algorithm works with unnormalized $\tilde{{{\bf w}}}^{l}_{}$ , which are updated according to

$\displaystyle \tilde{{{\bf w}}}^{l}_{}$ (t + 1) = $\displaystyle \tilde{{{\bf w}}}^{l}_{}$ (t) + $\displaystyle \alpha$ ^. $\displaystyle \left(\vphantom{\boldsymbol{\xi}^{(l)} y_l - \tilde{{\bf w}}^l(t)}\right.$ $\displaystyle \xi$ ^(l)y_l - $\displaystyle \tilde{{{\bf w}}}^{l}_{}$ (t) $\displaystyle \left.\vphantom{\boldsymbol{\xi}^{(l)} y_l - \tilde{{\bf w}}^l(t)}\right)$ , for l = 1,..., q .

(3.6)

y_l is a component of the vector $\bf y$ = $\bf W^{T}_{}$ $\xi$ . The deflated vector $\xi$ ^(l) is computed by iterating

$\displaystyle \xi$ ^(l+1) = $\displaystyle \xi$ ^(l) - $\displaystyle \bf w^{l}_{}$ y_l starting with $\displaystyle \xi$ ⁽¹⁾ = $\displaystyle \xi$ .

(3.7)

After each step t, the eigenvalue and eigenvector estimates are obtained from

$\displaystyle \lambda^{l}_{}$ = | $\displaystyle \tilde{{{\bf w}}}^{l}_{}$ |, $\displaystyle \bf w^{l}_{}$ = $\displaystyle {\frac{{\tilde{{\bf w}}^l}}{{\Vert\tilde{{\bf w}}^l\Vert}}}$ , for l = 1,..., q .

(3.8)

Since the orthogonality of $\bf W$ is not fully preserved for each step, the algorithm has to be combined with an orthogonalization method, here we used Gram-Schmidt (Möller, 2002). Orthogonality is essential for the computation of the error (3.2). This orthogonalization concludes one annealing step.

The unit centers are initialized by randomly chosen examples from the training set. The eigenvector estimates are initialized with random orthogonal vectors. The eigenvalues $\lambda^{l}_{j}$ and the variances $\sigma^{2}_{j}$ are initially set to one. To avoid zero variance in a direction during the computation of the local PCA, a uniform noise randomly chosen from the interval [- 0.0005;0.0005] is added to each component of a randomly drawn data point (for all experiments).

Next: 3.2.2 Alternative distance measure Up: 3.2 Extension of Neural Previous: 3.2 Extension of Neural

Heiko Hoffmann
2005-03-22