2.1.2 Probabilistic PCA

Next: 2.2 Vector quantization Up: 2.1 Principal component analysis Previous: 2.1.1 Neural networks for

2.1.2 Probabilistic PCA

Probabilistic PCA links PCA to the probability density of patterns $\bf x_{i}^{}$ (Tipping and Bishop, 1997). The given set { $\bf x_{i}^{}$ } is assumed to originate from a probability density p( $\bf x$ ). Further, $\bf x$ is assumed to be a linear combination of a vector $\bf y$ $\in$ IR^q with density p( $\bf y$ ) and a noise vector $\bf e$ $\in$ IR^d with density p( $\bf e$ ),

$\displaystyle \bf x$ = $\displaystyle \bf U$ $\displaystyle \bf y$ + $\displaystyle \bf e$ .

(2.7)

The goal is to find $\bf U$ , which is a d×q matrix. Both densities p( $\bf y$ ) and p( $\bf e$ ) are assumed to be uniformly Gaussian with variance one respective $\sigma^{2}_{}$ . Thus, the density p( $\bf x$ ) is defined uniquely up to the parameters $\bf U$ and $\sigma$ ,

p( $\displaystyle \bf x$ ) = (2 $\displaystyle \pi$ )^-d/2(det $\displaystyle \bf B$ )^-1/2exp $\displaystyle \left(\vphantom{-\frac{1}{2}{\bf x}^T{\bf B}^{-1}{\bf x}}\right.$ - $\displaystyle {\frac{{1}}{{2}}}$ $\displaystyle \bf x^{T}_{}$ $\displaystyle \bf B^{{-1}}_{}$ $\displaystyle \bf x$ $\displaystyle \left.\vphantom{-\frac{1}{2}{\bf x}^T{\bf B}^{-1}{\bf x}}\right)$

(2.8)

with $\bf B$ = $\sigma^{2}_{}$ $\bf I$ + $\bf U$ $\bf U^{T}_{}$ (Tipping and Bishop, 1997). Probabilistic PCA determines $\bf U$ and $\sigma$ such that the patterns $\bf x_{i}^{}$ if drawn from p( $\bf x$ ) are most likely (Tipping and Bishop, 1997). That is, the likelihood, which is

L = $\displaystyle \prod_{{i=1}}^{n}$ p( $\displaystyle \bf x_{i}^{}$ ) ,

(2.9)

is maximized (see appendix A.2 for an example of the maximum likelihood principle). The result of this optimization gives the matrix $\bf U$ (Tipping and Bishop, 1997),

$\displaystyle \bf U$ = $\displaystyle \bf W$ ( $\displaystyle \Lambda$ - $\displaystyle \sigma^{2}_{}$ $\displaystyle \bf I$ )^1/2 $\displaystyle \bf R$ .

(2.10)

The columns of the matrix $\bf W$ are the eigenvectors of the covariance matrix of { $\bf x_{i}^{}$ }; the diagonal matrix $\Lambda$ contains the corresponding eigenvalues, and $\bf R$ is an arbitrary rotational matrix (note, $\bf y$ has a uniform Gaussian distribution). The noise variance $\sigma^{2}_{}$ turns out to be the residual variance per dimension,

$\displaystyle \sigma^{2}_{}$ = $\displaystyle {\frac{{1}}{{d-q}}}$ $\displaystyle \sum_{{l=q+1}}^{d}$ $\displaystyle \lambda_{l}^{}$ .

(2.11)

To evaluate (2.11), only the q principal eigenvalues and the total variance (sum of variances over all dimensions, which equals the trace of the covariance matrix) need to be known. It is not necessary to compute the d - q minor principal components. Thus, the introduction of the noise allows the density p( $\bf x$ ) to be defined over the whole IR^d, while using a reduced parameter set (obtained by PCA). Equation (2.11) shows how fast p( $\bf x$ ) decreases orthogonal to the subspace spanned by the principal components.

Next: 2.2 Vector quantization Up: 2.1 Principal component analysis Previous: 2.1.1 Neural networks for

Heiko Hoffmann
2005-03-22