Next: 4.7 Discussion Up: 4. Abstract recurrent neural Previous: 4.5.2 Results

4.6 Dependence on the number of input dimensions

As shown in section 4.5, the performance of the abstract recurrent neural network depends on the number of input dimensions chosen. A higher number of input dimensions results in a higher sum of square errors per pattern. This relationship is investigated theoretically on a simplified abstract RNN.

The first simplification is that the model consists of spheres instead of ellipsoids. Thus, the distribution of training data is approximated by a set of m code-book vectors $\bf c^{j}_{}$ , and for each of them the potential field is given by the Euclidean distance. The second simplification is that the code-book vectors are uniformly randomly distributed. This is justified for distributions that are wrapped inside the pattern space, and are not restricted to embedded hyper-planes, as it is likely the case for the kinematic arm model. Recall works as in section 4.2 (see also figure 4.14).

**Figure 4.14:** The trained abstract network consists of a set of code-book vectors, here illustrated as circles. These vectors are distributed randomly. Recall happens by finding the closest vector to a constraint space (gray line). The offset of this line from the origin is the input.
$\includegraphics[width=9cm]{network.eps}$

We assume that the code-book vectors lie inside a d-dimensional cube of side length two, centered at the origin. Since the code-book vectors are distributed uniformly, the average error is independent of the specific value of the input (offset of the constrained space)^4.3. Thus, we arrange the constrained spaces such that they all go through the origin. If d - 1 input dimensions are given, the constraint is the x_d-axis. For d - 2 input dimensions the axis x_d and x_d-1 span the constraint space, and so on (see figure 4.15 as an example). Instead of choosing different input values to compute the sum of square errors, we draw a new set of $\bf c^{j}_{}$ from a random distribution for each test trial.

**Figure 4.15:** Example using r = 1 input dimensions in a n = 3 dimensional space. The x₂x₃ plane is the constrained space. Code-book vectors are illustrated as circles. Possible code-book locations are enclosed by the cube drawn with dotted lines. The dashed lines show the distances between constraint and code-book vectors.
$\includegraphics[width=9cm]{constraint.eps}$

Given r input dimensions, the squared distance E^j of $\bf c^{j}_{}$ to the constrained space is

E^j = $\displaystyle \sum_{{i=1}}^{r}$ $\displaystyle \left(\vphantom{c^{ j}_i}\right.$ c^j_i $\displaystyle \left.\vphantom{c^{ j}_i}\right)^{2}_{}$ .

(4.8)

We define the square error E as the minimum squared distance to the data approximation (in the kinematic arm model, this matches the computation of the square error in the case of arbitrary directions, see section 4.5.1). Thus,

E = min $\displaystyle \left\{\vphantom{\sum_{i=1}^r (c^1_i)^2, \sum_{i=1}^r (c^2_i)^2, \dots,\sum_{i=1}^r (c^m_i)^2}\right.$ $\displaystyle \sum_{{i=1}}^{r}$ (c¹_i)², $\displaystyle \sum_{{i=1}}^{r}$ (c²_i)²,..., $\displaystyle \sum_{{i=1}}^{r}$ (c^m_i)² $\displaystyle \left.\vphantom{\sum_{i=1}^r (c^1_i)^2, \sum_{i=1}^r (c^2_i)^2, \dots,\sum_{i=1}^r (c^m_i)^2}\right\}$ .

(4.9)

The sum of square errors per pattern is the expectation value of E given a random distribution of $\bf c^{j}_{}$ . To compute the expectation of a minimum, we use the following trick (Wentzell, 2003). The cumulative probability P_c(T) that all E^j are larger than a threshold T is computed. P_c(T) is monotone descending, starting with P_c(0) = 1. The negative derivative of P_c(T) can be interpreted as the probability density function of T. For a given T, this function provides the probability density that T equals the smallest member of the set {E^j}. Thus, the expectation value of T (the first momentum of the probability density function) equals the expectation value of E, the minimum of {E^j}. Therefore,

$\displaystyle \left\langle\vphantom{ E }\right.$ E $\displaystyle \left.\vphantom{ E }\right\rangle$ = $\displaystyle \left\langle\vphantom{ T }\right.$ T $\displaystyle \left.\vphantom{ T }\right\rangle$ = - $\displaystyle \int$ $\displaystyle {\frac{{d P_c(T)}}{{d T}}}$ T dT .

(4.10)

The probability p that a point $\bf c^{j}_{}$ has a squared distance E^j larger or equal to T is the cube volume outside the r-sphere^4.4with radius $\sqrt{{T}}$ centered at the origin divided by the cube volume (figure 4.16). The cumulative probability P_c is p to the power of m.

**Figure 4.16:** The gray area relative to the total area of the square is the probability that a point lies outside the circle. Two examples with different radii are shown.
$\includegraphics[width=13cm]{probarea.eps}$

To make the function p over T analytically integrable, we make an approximation. The r-cube, the space of all possible codebook vectors (in the first r dimensions), is replaced by an r-sphere with radius $\sqrt{{r}}$ centered at the origin. This sphere encloses tightly the r-cube. To compensate this step, we multiply the number of codebook vectors m with the r-sphere volume divided by the r-cube volume. The volume of an r-sphere with unit radius can be written as

V_r = $\displaystyle {\frac{{\pi^{\frac{r}{2}}}}{{\Gamma(\frac{r}{2}+1)}}}$ .

(4.11)

$\Gamma$ is the Gamma-function. It is related to the factorial by $\Gamma$ (n) = (n - 1)! for positive integers n. For r = 2, for example, the above volume V_r equals $\pi$ . Hence, the resulting volume relation v^r (here, for the sphere radius $\sqrt{{r}}$ ) is

v^r = $\displaystyle {\frac{{\pi^{\frac{r}{2}} r^\frac{r}{2}}}{{\Gamma(\frac{r}{2}+1) 2^r}}}$ .

(4.12)

The above step is justified by the uniform distribution of the $\bf c^{j}_{}$ . But, it is not an equivalence transformation (for r > 1). The quality of this approximation will be tested later. The resulting number of vectors, $\mu$ = int(v^rm), is rounded to an integer value. With the approximation, the probability p(T) that a vector has a squared distance E^j $\ge$ T can be expressed as

p(T) = 1 - $\displaystyle {\frac{{\sqrt{T} ^r}}{{\sqrt{r} ^r}}}$ .

(4.13)

The total probability P_c that all vectors fulfill the above condition is

P_c(T) = $\displaystyle \prod_{{j=1}}^{\mu}$ p(T) = (1 - ( $\displaystyle {\frac{{T}}{{r}}}$ )) .

(4.14)

The value of T extends from 0 to r. Using (4.10), we obtain for the expectation value of E

$\displaystyle \left\langle\vphantom{ E }\right.$ E $\displaystyle \left.\vphantom{ E }\right\rangle$ = - $\displaystyle \int_{0}^{r}$ $\displaystyle {\frac{{d P_c(T)}}{{d T}}}$ T dT = $\displaystyle \int_{0}^{r}$ P_c(T)dT - P_c(T)T $\displaystyle \Big\vert _{0}^{r}$ = $\displaystyle \int_{0}^{r}$ P_c(T)dT .

(4.15)

The second equality sign uses integration by parts, the last equality sign uses (4.14). The final integration, using (4.14), gives

$\displaystyle \left\langle\vphantom{ E }\right.$ E $\displaystyle \left.\vphantom{ E }\right\rangle$ = r $\displaystyle {\frac{{\Gamma(\frac{r+2}{r}) \Gamma(\mu + 1)}}{{\Gamma(\frac{2+r+\mu r}{r})}}}$ .

(4.16)

The integral was solved with the help of MATLAB® and its symbolic toolbox. Using the equality $\Gamma$ (x) = (x - 1) $\Gamma$ (x - 1), the expression can be simplified to

$\displaystyle \left\langle\vphantom{ E }\right.$ E $\displaystyle \left.\vphantom{ E }\right\rangle$ = r $\displaystyle \prod_{{j=1}}^{\mu}$ $\displaystyle {\frac{{j}}{{j+2/r}}}$ .

(4.17)

For large $\mu$ , the latter is more feasible for numerical evaluation than (4.16). We further investigate the quality of our approximation. The case m = 1 can be evaluated correctly. For one codebook vector, the expectation of its squared distance can be directly calculated:

$\displaystyle \left\langle\vphantom{ E }\right.$ E $\displaystyle \left.\vphantom{ E }\right\rangle$ = 2^-r $\displaystyle \int_{{-1}}^{1}$ $\displaystyle \int_{{-1}}^{1}$ ^... $\displaystyle \int_{{-1}}^{1}$ $\displaystyle \sum_{{i=1}}^{r}$ c_i² $\displaystyle \prod_{{i=1}}^{r}$ dc_i = 2^-rr $\displaystyle {\frac{{2^r}}{{3}}}$ = $\displaystyle {\frac{{r}}{{3}}}$ .

(4.18)

Figure 4.17 shows the result of this comparison. The mismatch increases with the number of input dimensions r.

**Figure 4.17:** Comparison between the approximation, replacing the r-cube by a r-sphere resulting in (4.12) and (4.17), and the correctly evaluated result (4.18) for one code-book vector.
$\includegraphics[width=11.3cm]{dependinput/checkm1.eps}$

We want to check our approximation using larger m. The result from (4.12) and (4.17) was compared to a simulation, in which $\bf c^{j}_{}$ were drawn randomly from a r-cube with side length 2. Figure 4.18 shows the result, using r = 10 input dimensions and 100 000 trials for each m value in the simulation. The approximation got better the more codebook vectors were used. The number of codebook vectors needed for a good approximation depends on the value of r. The higher r, the more vectors are needed.

**Figure 4.18:** Comparison between the theory, using (4.12) and (4.17), and the result from a simulation, with r = 10. The number of code-book vectors is m.
$\includegraphics[width=11.7cm]{dependinput/checkr10.eps}$

Finally, the dependency on the number of input dimensions is demonstrated for the case m = 200. As above, the theory is compared to a simulation, using 100 000 trials for each r value. In this test, theory and simulation results did overlap (figure 4.19). Between r = 5 and r = 8 the increase is approximately exponential with exponent 0.69.

**Figure 4.19:** Dependency of the mean square error per output dimension on the number of input dimensions r for m = 200 (d = 10). Theory, using (4.12) and (4.17), and simulation results are shown.
$\includegraphics[width=12cm]{dependinput/inputdim.eps}$

Next: 4.7 Discussion Up: 4. Abstract recurrent neural Previous: 4.5.2 Results

Heiko Hoffmann
2005-03-22