next up previous contents
Next: 2.4.1 Feature extraction Up: 2. Modeling of data Previous: 2.3.2 Mixture of probabilistic

2.4 Kernel PCA

Different from the mixture models, kernel PCA (Schölkopf et al., 1998b) just works with a single PCA. It is an extension of PCA to non-linear distributions. Instead of directly doing a PCA, the n data points $ \bf x_{i}^{}$ are mapped into a higher-dimensional (possibly infinite-dimensional) feature space,

$\displaystyle \bf x_{i}^{}$ $\displaystyle \rightarrow$ $\displaystyle \varphi$($\displaystyle \bf x_{i}^{}$) . (2.22)

As it turns out later, the computation of this mapping can be omitted. In the feature space, principal components are extracted. That is, the following equation needs to be solved (here, we first assume that {$ \varphi$($ \bf x_{i}^{}$)} has zero mean, see section 2.4.2):

$\displaystyle \lambda$$\displaystyle \bf w$ = $\displaystyle \bf C$$\displaystyle \bf w$ , (2.23)

with the covariance matrix $ \bf C$ = $ {\frac{{1}}{{n}}}$$ \sum_{{j=1}}^{n}$$ \varphi$($ \bf x_{j}^{}$)$ \varphi$($ \bf x_{j}^{}$)T. From the definition of $ \bf C$ follows that $ \bf C$$ \bf w$ is a linear combination of the vectors $ \varphi$($ \bf x_{i}^{}$). Thus, $ \bf w$ must lie in the span of $ \varphi$($ \bf x_{1}^{}$),...,$ \varphi$($ \bf x_{n}^{}$). Hence, we can write

$\displaystyle \bf w$ = $\displaystyle \sum_{{i=1}}^{n}$$\displaystyle \alpha_{i}^{}$$\displaystyle \varphi$($\displaystyle \bf x_{i}^{}$) . (2.24)

Combining (2.23) and (2.24) gives

$\displaystyle \lambda$$\displaystyle \sum_{{i=1}}^{n}$$\displaystyle \alpha_{i}^{}$$\displaystyle \varphi$($\displaystyle \bf x_{i}^{}$) = $\displaystyle {\frac{{1}}{{n}}}$$\displaystyle \sum_{{i,j=1}}^{n}$$\displaystyle \alpha_{i}^{}$$\displaystyle \varphi$($\displaystyle \bf x_{j}^{}$)$\displaystyle \left(\vphantom{ \boldsymbol{\varphi}({\bf x}_j)^T \boldsymbol{\varphi}({\bf x}_i)}\right.$$\displaystyle \varphi$($\displaystyle \bf x_{j}^{}$)T$\displaystyle \varphi$($\displaystyle \bf x_{i}^{}$)$\displaystyle \left.\vphantom{ \boldsymbol{\varphi}({\bf x}_j)^T \boldsymbol{\varphi}({\bf x}_i)}\right)$ , (2.25)

which is equivalent to the set of n equations

$\displaystyle \lambda$$\displaystyle \sum_{{i=1}}^{n}$$\displaystyle \alpha_{i}^{}$$\displaystyle \left(\vphantom{\boldsymbol{\varphi}({\bf x}_i)^T\boldsymbol{\varphi}({\bf x}_l)}\right.$$\displaystyle \varphi$($\displaystyle \bf x_{i}^{}$)T$\displaystyle \varphi$($\displaystyle \bf x_{l}^{}$)$\displaystyle \left.\vphantom{\boldsymbol{\varphi}({\bf x}_i)^T\boldsymbol{\varphi}({\bf x}_l)}\right)$ = $\displaystyle {\frac{{1}}{{n}}}$$\displaystyle \sum_{{i,j=1}}^{n}$$\displaystyle \alpha_{i}^{}$$\displaystyle \left(\vphantom{\boldsymbol{\varphi}({\bf x}_j)^T\boldsymbol{\varphi}({\bf x}_l)}\right.$$\displaystyle \varphi$($\displaystyle \bf x_{j}^{}$)T$\displaystyle \varphi$($\displaystyle \bf x_{l}^{}$)$\displaystyle \left.\vphantom{\boldsymbol{\varphi}({\bf x}_j)^T\boldsymbol{\varphi}({\bf x}_l)}\right)$$\displaystyle \left(\vphantom{ \boldsymbol{\varphi}({\bf x}_j)^T \boldsymbol{\varphi}({\bf x}_i)}\right.$$\displaystyle \varphi$($\displaystyle \bf x_{j}^{}$)T$\displaystyle \varphi$($\displaystyle \bf x_{i}^{}$)$\displaystyle \left.\vphantom{ \boldsymbol{\varphi}({\bf x}_j)^T \boldsymbol{\varphi}({\bf x}_i)}\right)$    $\displaystyle \forall$ l . (2.26)

The direction from (2.26) to (2.25) is fulfilled because the left side of (2.25) is in the span of $ \bf\boldsymbol{\varphi}$($ \bf x_{1}^{}$),...,$ \varphi$($ \bf x_{n}^{}$), and (2.26) defines all n projections on $ \varphi$($ \bf x_{i}^{}$). Equation (2.26) has the favorable property that it is written entirely with scalar products in the feature space. Hence, we do not need to carry out the transformation $ \bf\boldsymbol{\varphi}$, which would be computationally impossible for an infinite-dimensional feature space. It is enough to work in the original space. Thus, instead of working with the scalar product $ \varphi$($ \bf x$)T$ \varphi$($ \bf y$), we are only working with a kernel function k($ \bf x$,$ \bf y$) = $ \varphi$($ \bf x$)T$ \varphi$($ \bf y$). For the given data points, this function can be written as a matrix $ \bf K$, with Kij = k($ \bf x_{i}^{}$,$ \bf x_{j}^{}$). Using the kernel matrix, (2.26) can be written as

n$\displaystyle \lambda$$\displaystyle \bf K$$\displaystyle \alpha$ = $\displaystyle \bf K^{2}_{}$$\displaystyle \alpha$ , (2.27)

with $ \alpha$ = ($ \alpha_{1}^{}$,...,$ \alpha_{n}^{}$)T. As shown in appendix C.2, (2.27) is equivalent to

n$\displaystyle \lambda$ $\displaystyle \alpha$ = $\displaystyle \bf K$$\displaystyle \alpha$ . (2.28)

Thus, the vector $ \alpha$ for each principal component can be obtained by extracting the eigenvectors of $ \bf K$. For further processing, the principal component $ \bf w$ needs to be normalized to have unit length. This can be also established by working solely with the kernel,

||$\displaystyle \bf w$||2 = $\displaystyle \left(\vphantom{\sum_{i=1}^n \alpha_i \boldsymbol{\varphi}({\bf x}_i)}\right.$$\displaystyle \sum_{{i=1}}^{n}$$\displaystyle \alpha_{i}^{}$$\displaystyle \varphi$($\displaystyle \bf x_{i}^{}$)$\displaystyle \left.\vphantom{\sum_{i=1}^n \alpha_i \boldsymbol{\varphi}({\bf x}_i)}\right)^{T}_{}$$\displaystyle \left(\vphantom{\sum_{j=1}^n \alpha_j \boldsymbol{\varphi}({\bf x}_j)}\right.$$\displaystyle \sum_{{j=1}}^{n}$$\displaystyle \alpha_{j}^{}$$\displaystyle \varphi$($\displaystyle \bf x_{j}^{}$)$\displaystyle \left.\vphantom{\sum_{j=1}^n \alpha_j \boldsymbol{\varphi}({\bf x}_j)}\right)$ = $\displaystyle \alpha$T$\displaystyle \bf K$$\displaystyle \alpha$ = n$\displaystyle \lambda$ $\displaystyle \alpha$T$\displaystyle \alpha$ , (2.29)

which results in a normalization rule for $ \alpha$.

To apply kernel PCA, a data point's features (the projections on the principal components) need to be extracted, and the formalism needs to be adjusted to distributions that do not have zero mean in feature space. These two points are addressed in the following sections. Furthermore, a short list of common kernel functions is given.

next up previous contents
Next: 2.4.1 Feature extraction Up: 2. Modeling of data Previous: 2.3.2 Mixture of probabilistic
Heiko Hoffmann