The abstract recurrent neural network based on the mixture of local PCA and the pattern association based on kernel PCA could both be applied successfully to learn visually guided reaching and grasping. Herein, the recall on the mixture of local PCA was 2 000-times faster than on kernel PCA.
MPPCA-ext did better on this task than NGPCA and NGPCA-constV. The distribution of training data is sparse (3371 patterns in 68 dimensions) and thin (locally three-dimensional and little noise). In these cases, MPPCA-ext proved to be better than the NGPCA variants (figure 3.7 and 3.8). NGPCA had problems with dead units (units with no patterns assigned to), see figure 6.8. As expected in section 3.2.1, the modification NGPCA-constV could solve this (figure 6.8). Both NGPCA variants were sensitive to the choice of training parameters (table 6.1).
Reaching and grasping were achieved by associating final arm postures, and not by planning trajectories. This association is consistent with the finding that in the monkey, the stimulation of certain motor cortex neurons leads to hand locations independent of the initial arm posture (Graziano et al., 2002). Moreover, such an association may explain why neurons in the premotor cortex area F5 fire both during the presentation of a tangible object and during the grasping of the object (Rizzolatti and Fadiga, 1998; Murata et al., 1997; Rizzolatti et al., 1988). Murata et al. (1997) wrote, ``the visual features are automatically (regardless of any intention to move) `translated' into a potential motor action'' (p. 2229).
The present study further showed that on the one hand, the dimensionality of the original images needs to be reduced, and on the other hand, it cannot be reduced too much; redundancy proved to be helpful. Also in the brain redundancy is widespread. It probably has the following two effects:
First, as mentioned in section 5.1, data points in a higher-dimensional space are more likely to be linearly separable (Cover, 1965). Therefore, they can be better described by locally linear models.
Second, the redundant coding reduces the effect of noise (Latham et al., 2003). The small effect that the noise had on the performance of the abstract RNN (tabel 6.3) can by part be explained with redundancy. The extend of the end-effector positions on the table was 30×40 cm. Thus, without population coding, a 10% noise leads to a position error with variance 208 mm2 (given a uniform noise distribution). Compared to this value, the increase in the variance of the position error for the look-up table--which cannot use averaging for noise compensation--was smaller ( 17.52 - 13.02 = 137 mm2).
For the low performance when using a population code only for the visual information (table 6.4) another explanation was found. Here, the input is 20-dimensional, and the output is only 12-dimensional. In addition, for the same input, redundant postures were possible that only differed in the joint angle near the gripper (section 6.2.2). Thus, the corresponding training patterns differed in only two out of 32 dimensions (one for pre-grasping and one for grasping). Since these patterns were therefore relatively close, they were both assigned to a single unit in the mixture model. Finally, in recall, the output was averaged over the redundant postures within one unit, and this resulted in an erroneous orientation of the gripper. This explains the higher orientation error, while the position error was almost the same as in the case with population coded angles (table 6.4).
The image processing in this chapter relates to biology in several ways. First, it is parallel and local within the image. Second, a compass filter is a simplified version of a simple cell in the primary visual cortex (V1) (Hubel and Wiesel, 1962). Third, like neighboring V1 cells (Blasdel and Salama, 1986), neighboring Gaussian activation functions (coarse image) respond similar to a given stimulus. And fourth, the final preprocessed information is given in population codes.
Population codes and tuning curves are widespread in the brain. Tuning curves can be observed, for example, in the monkey for the direction of moving stimuli (Treue and Trujillo, 1999) and in the cricket for the airflow direction (Miller et al., 1991). The abstract RNN can directly associate one population code with another one, without decoding to scalar values6.2. For the robot arm, the population-coded joint angles were decoded. For a biological system, however, such a step can be omitted since a population code can directly act on a muscle. A theoretical account on this was given by Baldi and Heiligenberg (1988).
The presented robot-arm setup and the flexibility of the abstract RNN offer several options to extend the task: First, the retinal object position (coarse image) can be replaced by information on the gaze direction of the cameras. This was achieved in cooperation with Wolfram Schenck (Schenck et al., 2003). In that study, a saccade controller (Schenck and Möller, 2004) controlled a pan-tilt unit, which supported the stereo camera system. The saccade controller learned to fixate the brick on the table. Then, the tilt and pan variables that define the gaze direction were encoded with tuning curves, as in section 6.2.4. The resulting population codes and the edge histogram were enough to associate an arm posture for grasping. However, the saccade controller required feedback from the environment, and therefore, an understanding of the object's location with covert motor commands is not possible anymore.
Second, monocular vision can be extended to stereo vision. The image processing can be applied to both cameras separately, and the resulting population codes can all be fed into the abstract RNN. Stereo vision would allow grasping in three-dimensional space (Kuperstein, 1990). However, it is difficult to collect training samples because an object held by the robot is at least partially occluded by the gripper.
Third, for grasping, the training set can be extended to bricks that do not lie, but stand on the table. A standing brick can be put on the table with the gripper in a horizontal orientation. Then, the abstract RNN would learn both: lying bricks and standing bricks. As a result, the image of a standing brick would associate a different gripper orientation than the image of a lying brick. The robot could therefore perceive (or understand) if the brick is standing or lying depending on the associated arm posture.
Fourth, the training set can be extended to include other objects. Different objects can be grasped in different ways. The analysis of associated grasping postures could therefore be used to identify the objects6.3. However, this association would not solve object constancy6.4, since the association of arm postures cannot do better than a classification of object images. A solution to object constancy could be to anticipate the sensory consequence of a sequence of motor commands (section 1.4.3, (Möller, 1999)). Chapter 7 presents a mobile robot that simulates such a sequence.