next up previous contents
Next: D. Database of hand-written Up: C. Proofs Previous: C.3 Estimate of error

C.4 Contraction of input vectors

This section shows that a multi-layer perceptron maps data points outside its training domain closer to its domain if the perceptron is trained to map data distributed in a circle onto the same circle (see section 7.4). Let $ \beta$$ \bf s$ be the input to the trained network. Here, $ \bf s$ has unit length and $ \beta$ is a scalar.

We study the effect of $ \beta$ on the network output $ \bf o$. Let $ \bf U$ be a h×2 matrix containing the weights between the input and the hidden layer, and $ \bf V$ be a 2×h matrix with the weights between the hidden and the output layer. Further, let $ \bf u_{k}^{}$ be a column vector of $ \bf U$, and $ \bf v_{k}^{}$ be a row vector of $ \bf V$. We assume that all threshold values equal zero, and that the weights fulfill: $ \bf u_{k}^{T}$$ \bf u_{l}^{}$ = $ \delta_{{kl}}^{}$ and $ \bf v_{k}^{T}$$ \bf v_{l}^{}$ = $ \delta_{{kl}}^{}$.

We first look at the case $ \beta$ = 1. The network output is

oi(1) = $\displaystyle \sum_{{j=1}}^{h}$vijtanh$\displaystyle \left(\vphantom{\sum_{k=1}^2 u_{jk} s_k}\right.$$\displaystyle \sum_{{k=1}}^{2}$ujksk$\displaystyle \left.\vphantom{\sum_{k=1}^2 u_{jk} s_k}\right)$ . (C.15)

As a result of the network training, $ \bf o$(1) has unit length. Let $ \bf y$ = $ \bf U$$ \bf s$ be the argument of the tanh-function. From the assumptions follows that $ \bf y$ has unit length,

||$\displaystyle \bf y$||2 = $\displaystyle \left\Vert\vphantom{\sum_{k=1}^2 s_k {\bf u}_k}\right.$$\displaystyle \sum_{{k=1}}^{2}$sk$\displaystyle \bf u_{k}^{}$$\displaystyle \left.\vphantom{\sum_{k=1}^2 s_k {\bf u}_k}\right\Vert^{2}_{}$ = $\displaystyle \sum_{{k=1}}^{2}$sk2 = 1 . (C.16)

Thus, the states $ \bf y$ lie on a circle with radius one and spanned by {$ \bf u_{k}^{}$} in a h-dimensional space (figure C.2).

Figure C.2: Image of the training patterns (gray ellipse) in the space of the hidden neurons (here, h = 3). The circle lies on a plane spanned by {$ \bf u_{k}^{}$}. The vectors {$ \bf v_{k}^{}$} lie in the same plane.

Let $ \tilde{{\bf y}}$ be the vector with components tanh(yj). A larger number h of hidden units leads to smaller components of $ \bf y$ (section 7.4: yj equals on average 1/h). Therefore, we approximate tanh(yj) $ \approx$ yj. It follows that also $ \tilde{{\bf y}}$ lies on the circle in the span of {$ \bf u_{k}^{}$}.

Next, we look at the effect of the weight matrix $ \bf V$. After training, all $ \bf x$ (which have unit length) are mapped (C.15) onto a circle with radius one. Thus, $ \bf V$ needs to project the circle in the span of {$ \bf u_{k}^{}$} onto the unit circle in the two-dimensional output space. This is only achieved if both row vectors $ \bf v_{1}^{}$ and $ \bf v_{2}^{}$ lie in the span of {$ \bf u_{k}^{}$} (otherwise, the projection would be an ellipse). It follows that $ \tilde{{\bf y}}$ is also in the span of {$ \bf v_{k}^{}$}, and any vector $ \tilde{{\bf y}}$ in the span of {$ \bf v_{k}^{}$} can be written as $ \tilde{{\bf y}}$ = $ \sum_{k}^{}$($ \tilde{{\bf y}}^{T}_{}$$ \bf v_{k}^{}$)$ \bf v_{k}^{}$.

Next, we look at the case $ \beta$ > 1. Let $ \tilde{{\bf y}}$($ \beta$) be the vector with components tanh($ \beta$yj). Here, the above tanh-approximation is generally not valid, and $ \tilde{{\bf y}}$($ \beta$) might protrude out of the plane spanned by {$ \bf v_{k}^{}$}. Thus, we need to write $ \tilde{{\bf y}}$($ \beta$) = $ \sum_{k}^{}$$ \left(\vphantom{\tilde{\bf y}(\beta)^T {\bf v}_k}\right.$$ \tilde{{\bf y}}$($ \beta$)T$ \bf v_{k}^{}$$ \left.\vphantom{\tilde{\bf y}(\beta)^T {\bf v}_k}\right)$$ \bf v_{k}^{}$ + $ \bf b$, with $ \bf b$ orthogonal to {$ \bf v_{k}^{}$}. The squares of this equation fulfill ||$ \tilde{{\bf y}}$($ \beta$)||2 = $ \sum_{k}^{}$||$ \tilde{{\bf y}}$($ \beta$)T$ \bf v_{k}^{}$||2 + ||$ \bf b$||2, from which follows:

$\displaystyle \sum_{k}^{}$|$\displaystyle \tilde{{\bf y}}$($\displaystyle \beta$)T$\displaystyle \bf v_{k}^{}$|2$\displaystyle \le$|$\displaystyle \tilde{{\bf y}}$($\displaystyle \beta$)|2 . (C.17)

Therefore, for $ \beta$ > 1, the squared length of the output vector $ \bf o$ can be written as

||$\displaystyle \bf o$($\displaystyle \beta$)||2 = $\displaystyle \sum_{{k=1}}^{2}$$\displaystyle \left(\vphantom{\sum_{j=1}^h \left(\tanh \beta y_j\right) v_{kj}}\right.$$\displaystyle \sum_{{j=1}}^{h}$$\displaystyle \left(\vphantom{\tanh \beta y_j}\right.$tanh$\displaystyle \beta$yj$\displaystyle \left.\vphantom{\tanh \beta y_j}\right)$vkj$\displaystyle \left.\vphantom{\sum_{j=1}^h \left(\tanh \beta y_j\right) v_{kj}}\right)^{2}_{}$$\displaystyle \le$$\displaystyle \sum_{{j=1}}^{h}$$\displaystyle \left(\vphantom{\tanh \beta y_j}\right.$tanh$\displaystyle \beta$yj$\displaystyle \left.\vphantom{\tanh \beta y_j}\right)^{2}_{}$ < $\displaystyle \beta^{2}_{}$$\displaystyle \sum_{{j=1}}^{h}$$\displaystyle \left(\vphantom{\tanh y_j}\right.$tanh yj$\displaystyle \left.\vphantom{\tanh y_j}\right)^{2}_{}$. (C.18)

The last inequality follows from tanh($ \beta$) being a convex function for $ \beta$ > 0. Under the assumption tanh(yj) = yj and (C.16), the last term in (C.18) equals $ \beta^{2}_{}$. Thus,

|$\displaystyle \bf o$($\displaystyle \beta$)| < $\displaystyle \beta$ . (C.19)

Points further away from the circle are mapped closer to the circle (the training domain).

next up previous contents
Next: D. Database of hand-written Up: C. Proofs Previous: C.3 Estimate of error
Heiko Hoffmann