Next: B.3 Quality measure for Up: B. Algorithms Previous: B.1 Power method with

B.2 Kernel PCA speed-up

The computational load for the projection onto a principal component is high, n evaluations of k( $\bf z$ , $\bf x_{i}^{}$ ). In the context of support vector machines, Burges (1996) introduced a speed-up usable for the extraction of the principal components (Schölkopf et al. (1998b) suggested that this could be also used for kernel PCA). The idea is to approximate each vector $\bf w$ = $\sum_{{i=1}}^{n}$ $\alpha_{i}^{}$ $\bf\boldsymbol{\varphi}$ ( $\bf x_{i}^{}$ ) by another vector $\bf w{^\prime}$ using only a small set of vectors $\bf y_{i}^{}$ from the original space, instead of the whole set { $\bf x_{i}^{}$ },

$\displaystyle \bf w{^\prime}$ = $\displaystyle \sum_{{i=1}}^{m}$ $\displaystyle \beta_{i}^{}$ $\displaystyle \bf\boldsymbol{\varphi}$ ( $\displaystyle \bf y_{i}^{}$ ) .

(B.3)

m is set a priori to a value much smaller than n. The set {( $\bf y_{i}^{}$ , $\beta_{i}^{}$ ) | i = 1,..., m} is called reduced set (Burges, 1996).

Here, instead of minimizing || $\bf w$ - $\bf w{^\prime}$ ||² to determine the reduced set, we use a method introduced by Schölkopf et al. (1998a). It is computationally expensive to optimize over the whole reduced set simultaneously; thus instead, an iterative method extracts the $\bf y_{i}^{}$ one by one. Moreover, the optimization over $\bf y_{i}^{}$ and $\beta_{i}^{}$ is split.

First, starting with $\bf y_{1}^{}$ , we minimize the distance between $\bf w$ and its projection onto the span of $\bf\boldsymbol{\varphi}$ ( $\bf y_{1}^{}$ ),

$\displaystyle \min_{{{\bf y}_1}}^{}$ $\displaystyle \left(\vphantom{\left\Vert\frac{{\bf w}^T {\bf\boldsymbol{\varphi... ...\bf y}_1)} {\bf\boldsymbol{\varphi}} ({\bf y}_1) - {\bf w}\right\Vert^2}\right.$ $\displaystyle \left\Vert\vphantom{\frac{{\bf w}^T {\bf\boldsymbol{\varphi}}({\b... ...{\varphi}} ({\bf y}_1)} {\bf\boldsymbol{\varphi}} ({\bf y}_1) - {\bf w}}\right.$ $\displaystyle {\frac{{{\bf w}^T {\bf\boldsymbol{\varphi}}({\bf y}_1)}}{{{\bf\boldsymbol{\varphi}} ({\bf y}_1)^T {\bf\boldsymbol{\varphi}} ({\bf y}_1)}}}$ $\displaystyle \bf\boldsymbol{\varphi}$ ( $\displaystyle \bf y_{1}^{}$ ) - $\displaystyle \bf w$ $\displaystyle \left.\vphantom{\frac{{\bf w}^T {\bf\boldsymbol{\varphi}}({\bf y}... ...({\bf y}_1)} {\bf\boldsymbol{\varphi}} ({\bf y}_1) - {\bf w}}\right\Vert^{2}_{}$ $\displaystyle \left.\vphantom{\left\Vert\frac{{\bf w}^T {\bf\boldsymbol{\varphi... ...\bf y}_1)} {\bf\boldsymbol{\varphi}} ({\bf y}_1) - {\bf w}\right\Vert^2}\right)$ .

(B.4)

Minimizing this distance is equivalent to maximizing

$\displaystyle {\frac{{\left({\bf w}^T {\bf\boldsymbol{\varphi}} ({\bf y}_1)\rig... ...{\bf\boldsymbol{\varphi}} ({\bf y}_1)^T{\bf\boldsymbol{\varphi}} ({\bf y}_1)}}}$ = $\displaystyle {\frac{{\left(\sum_{i=1}^n \alpha_i k({\bf x}_i,{\bf y}_1)\right)^2}}{{k({\bf y}_1,{\bf y}_1)}}}$ .

(B.5)

This optimization problem is much less demanding than the before mentioned, the dimensionality is the one of the pattern space. The denominator is constant for radial basis function kernels. Here, only the numerator needs to be maximized.

After $\bf y_{1}^{}$ is determined, the optimal $\beta_{1}^{}$ is computed. Generally, if the values $\bf y_{i}^{}$ are known, the corresponding optimal $\beta_{i}^{}$ can be obtained analytically by setting the derivatives ${\frac{{\partial}}{{\partial \beta_i}}}$ || $\bf w$ - $\bf w{^\prime}$ ||² zero (Schölkopf et al., 1998a). The result is

$\displaystyle \beta$ = ( $\displaystyle \bf K^{y}_{}$ )^-1 $\displaystyle \bf K^{{yx}}_{}$ $\displaystyle \alpha$ .

(B.6)

K^y is the matrix k( $\bf y_{i}^{}$ , $\bf y_{j}^{}$ ), and K^yx the matrix k( $\bf y_{i}^{}$ , $\bf x_{j}^{}$ ).

$\beta_{1}^{}$ $\bf\boldsymbol{\varphi}$ ( $\bf y_{1}^{}$ ) alone is not enough to replace w. Therefore, the second point $\bf y_{2}^{}$ is chosen such that the remaining vector $\bf w$ - $\beta_{1}^{}$ $\bf\boldsymbol{\varphi}$ ( $\bf y_{1}^{}$ ) is approximated by $\beta_{2}^{}$ $\bf\boldsymbol{\varphi}$ ( $\bf y_{2}^{}$ ). This leads to an iterative scheme, with $\bf w_{{t+1}}^{}$ = $\bf w_{t}^{}$ - $\beta_{t}^{}$ $\bf\boldsymbol{\varphi}$ ( $\bf y_{t}^{}$ ), starting with $\bf w_{1}^{}$ = $\bf w$ . At each step, $\bf w_{t}^{}$ is approximated by $\beta_{t}^{}$ $\bf\boldsymbol{\varphi}$ ( $\bf y_{t}^{}$ ). That is, $\bf y_{t}^{}$ is obtained by maximizing (B.5), and then, { $\beta_{i}^{}$ | i = 1,..., t} are calculated using (B.6). Every iteration step, every addition of a set ( $\bf y_{t}^{}$ , $\beta_{t}^{}$ ), reduces the distance to the vector $\bf w$ . In this way, the complete reduced set can be obtained.

In kernel PCA, more than one vector needs to be approximated. To do this, the above method can be generalized (Schölkopf et al., 1998a). Instead of (B.4), the sum of the square projection distances is minimized,

$\displaystyle \min_{{{\bf y}_t}}^{}$ $\displaystyle \left(\vphantom{\sum_{l=0}^q \left\Vert\frac{{{\bf w}^l }^T {\bf... ...f y}_t)} {\bf\boldsymbol{\varphi}} ({\bf y}_t) - {\bf w}^l\right\Vert^2}\right.$ $\displaystyle \sum_{{l=0}}^{q}$ $\displaystyle \left\Vert\vphantom{\frac{{{\bf w}^l }^T {\bf\boldsymbol{\varphi... ...varphi}} ({\bf y}_t)} {\bf\boldsymbol{\varphi}} ({\bf y}_t) - {\bf w}^l}\right.$ $\displaystyle {\frac{{{{\bf w}^l }^T {\bf\boldsymbol{\varphi}} ({\bf y}_t)}}{{{\bf\boldsymbol{\varphi}} ({\bf y}_t)^T {\bf\boldsymbol{\varphi}} ({\bf y}_t)}}}$ $\displaystyle \bf\boldsymbol{\varphi}$ ( $\displaystyle \bf y_{t}^{}$ ) - $\displaystyle \bf w^{l}_{}$ $\displaystyle \left.\vphantom{\frac{{{\bf w}^l }^T {\bf\boldsymbol{\varphi}} (... ...\bf y}_t)} {\bf\boldsymbol{\varphi}} ({\bf y}_t) - {\bf w}^l}\right\Vert^{2}_{}$ $\displaystyle \left.\vphantom{\sum_{l=0}^q \left\Vert\frac{{{\bf w}^l }^T {\bf... ...f y}_t)} {\bf\boldsymbol{\varphi}} ({\bf y}_t) - {\bf w}^l\right\Vert^2}\right)$ .

(B.7)

Here, $\bf w^{0}_{}$ is the center vector ${\frac{{1}}{{n}}}$ $\sum_{{i=1}}^{n}$ $\bf\boldsymbol{\varphi}$ ( $\bf x_{i}^{}$ ); and $\bf w^{l}_{}$ , with l = 1,..., q, are the q eigenvectors $\sum_{{i=1}}^{n}$ $\alpha^{l}_{i}$ $\bf\boldsymbol{\varphi}$ ( $\bf x_{i}^{}$ ). The iteration is the same as above, $\bf w^{l}_{{t+1}}$ = $\bf w^{l}_{t}$ - $\beta^{l}_{t}$ $\bf\boldsymbol{\varphi}$ ( $\bf y_{t}^{}$ ). The $\beta_{i}^{{ l}}$ are computed for each vector separately, $\beta$ ^l = ( $\bf K^{y}_{}$ )^-1 $\bf K^{{yx}}_{}$ $\alpha$ ^l. However, each approximated vector is based on the same set { $\bf y_{i}^{}$ }.

The result of this reduced set method is that all vectors that are expressed as a sum over n kernel functions, can be obtained as a sum over only m kernel functions. Thus, the speed gain is n/m.

Next: B.3 Quality measure for Up: B. Algorithms Previous: B.1 Power method with

Heiko Hoffmann
2005-03-22