[Table of Contents]

Interaction between data selection and MLP training:
modeling hippocampus/neocortex with finite capacity memory

Janet Wiles
Departments of Computer Science and Psychology
University of Queensland
janetw@cs.uq.edu.au

Target Papers: Simulations: Recency and error rehearsal over quasi-combinatorial data.

Introduction:

The overall task of the model

The model targeted in this case study is based on an analogy made by McClelland, McNaughton and O'Reilly (1994) between the properties of the multi-layer perceptron (MLP) and the memory requirements of the hippocampus/neocortex (HNC) memory system. The hippocampus is viewed as a short-term memory store, and the neocortex as a memory system able to learn the structure underlying the temporarily stored items. Functionally, these are roles played by the training batch (a temporary store) and the MLP (as the learning system).

The task considered in this Case Study is an idealized version of a reading task - mapping letters to phonemes. The task was designed by S. Andrews at UNSW to explore the representational properties of networks over different levels of regularity. Two of Andrew's data sets were used in the simulations reported: the first was regular data, which comprised a one-to-one mapping of letters to phonemes; the second was quasi-regular in which most of the mappings followed the regular pattern, but some irregular mappings were included. (Most vowels were short, e.g., b-a-t, h-o-t, s-a-x were pronounced as they would be spoken, but some vowels were long, e.g., c-i-m would be pronounced `kime', s-e-p would be pronounced `seep'). The task itself is not intended to represent a specific type of task performed by the HNC memory system: It's use in the study is simply as a conveniently structured data set to investigate generalization performance over combinatorial domains.

The data and input/output task

In the reading task, syllables are presented to the MLP as 3-letter patterns (onset, vowel and coda). There are 6 possible letters for each position, and each letter is represented by a 6-bit local code. The output patterns are similar in that each is a 3-phoneme pattern, with each phoneme represented by a local code. In the quasi-regular data, two of the 6 vowels were mapped to a unique short phonetic form, with the other four varying between long and short forms, depending on the combination of the onset and coda letters. There were 216 (6x6x6) syllables, 12 of them being irregular.

The training paradigm involved two stages: An initial set of items was selected (Set A) fixing the capacity of the training batch. Three different capacity limits were used - 8, 50 and 108 items. The MLP was trained on Set A to criterion (all outputs within 0.2 of their targets). This first stage of training is the usual procedure followed in MLPs. At the end of this stage, the performance of the MLP was as expected - Set A was learned well, and the larger its size, the better the generalization to other items in the domain.

The second (and more interesting) stage of the training paradigm concerned the gradual addition of remaining items to the finite capacity training batch: As each additional item was added to the training batch, one of the old items was deleted (so as not to exceed the capacity limit) and the MLP retrained on the modified training batch. Gray and Wiles tested several methods of deleting items. The first and simplest method of deleting items was recency rehearsal, in which the oldest item was removed (Gray & Wiles, 1996a); the second was a more complex scheme called error rehearsal, in which the item with the lowest error was removed.

The network

The MLP consists of 3 layers (an 18-18-22 MLP): 18 input units, corresponding to the three letter positions (3x6); 18 hidden units; and 22 output units, corresponding to the three phoneme positions (3x6 + 4 irregular vowels which were only used in the quasi-regular case). Both hidden and output units used sigmoid activation functions. Input/target pairs from the training batch were presented randomly to the network, which was trained using backpropagation. It should be noted that the model of the HNC memory system comprises both the MLP and the batch method of training, as the finite capacity of the batch is a crucial aspect of the model.

Describe how the training scheme addresses the cognitive task

The I/O task involves a simple mapping, and the interest lies not in the task itself, but in the interaction between the quasi-regular data set and the generalization performance of the MLP: The role of the hippocampus as a temporary memory store is predicated in part on the computational necessity of storing items that are presented only once to the HNC memory system. It is well known that MLPs do not develop well structured representations of unrelated items that are presented sequentially, in the worst case exhibiting a phenomena known as catastrophic interference.

McClelland et al (MMO, 1994) proposed that the hippocampus, acting as a temporary memory store, interleaves presentations of its items to neocortex in a manner analogous to MLP training. The training batch then corresponds to initial storage of items in the hippocampus. The simulations showed that the structure underlying regular items can be learned reasonably well using even a simple rehearsal scheme such as recency rehearsal, but that interference remains a problem for irregular items and some of their nearest regular neighbours, as soon as they are omitted from the training batch. This result partially supports MMO's analogy, but did not solve the problem of irregular items. With the error rehearsal scheme, both irregular and regular items showed much less interference, even after leaving the training batch.

Structure:

How structure in the world is represented as structure in the model

In these simulations, the structure ``in the world'' consists of the regularities in the letter-phoneme mapping task, which are revealed through the statistics of the training items. The irregular items follow the regular pattern in the mapping of onset and coda letters, but are exceptions in the mapping of vowels. The structure is represented in hidden-unit space (although it was not investigated in the simulations described here).

In recency rehearsal the MLP learned the regular structure of items sufficiently to generalize to items not in the training batch, but only learned the irregular items for the time that they remained in the training batch. In error rehearsal, for the large batch size, the majority of both regular and irregular items were shielded from interference after leaving the training set. It seems plausible that as irregular items and their regular neighbours spent longer in the training batch, they be likely to have better separation in hidden-unit space, and after such items were well learned, new items would be less likely to interfere with the hidden-unit space separations.

Memory:

The information to be stored

There are two levels ar which information is stored in the memory: the first is the presentation of each item from the domain to the training batch; the second concerns how the structure of the domain is stored in the hidden-unit space.

The mapping task comprises (1) regular letter-to-phoneme correspondences in which each letter in each position can be treated independently and (2) irregular correspondences, in which certain vowels, in the context of specific consonants, map to long vowels. The composition of an input pattern as the three variables of onset, vowel, and coda is extracted from the statistics of the patterns in the domain. No such division is initially encoded into the network in any way. The input/output task is a simple encoder in that information in the domain to be stored in the model is contained in the individual items, not in their sequences or temporal relationships.

The mechanisms of storage

The MLP model of the HNC memory system uses the training batch as a fast "one-shot" learning system. A plausible implementation mechanism for such a system is not addressed in this simulation, but its requirement is recognized. That is, the memory provided by the training batch is functionally necessary but there is no commitment that a more accurate model of hippocampus share any properties except that of fast storage of items. The focus for the simulations is the effect of a finite capacity for this information. Thus the training batch is limited to a finite size, and each additional item causes a current item in the batch to be removed. In both selection methods, the information entering the training batch is the same, but the items leaving the batch differ. In recency rehearsal, the selection of items is passive, and could be implemented by a simple decay mechanism. In error rehearsal, selection of items requires feedback on the performance of each item.

The mechanisms of memory in the MLP involve a slow process of learning of the weights using backpropagation. Each training batch is presented as many times as required for the MLP to learn the items it contains.

Describe how the mechanisms achieve the memory

Performance on the regular items is in part due to generalization from other items, as the MLP generalizes well to novel combinations of familiar items (though it cannot generalize to novel positions of items).

The ability of the MLP to represent any arbitrary combination of regular and irregular items is one of its attractions as a learning mechanism, however it is also a cause of interference, as the structure learned by the MLP reflects the structure of items in the training batch. Old items deleted from the training batch can be interfered with by new items that have a conflicting structure (for an old regular item, a new neighbouring irregular item could cause interference, and vice versa).

Time:

There are two time scales in the model: the first is the presentation of items to the whole system, viewed as the presentation of items to the training batch; the second is conceptually a much faster time scale, concerning the interleaved training of the MLP using the existing training batch. The former sense of time is illustrated in graphs demonstrating the percentage generalization to the whole domain after training on each additional item. As in many neural network models, time is time as sequence, and information about the absolute or relative durations of events is discarded. In recency rehearsal, time is implicitly represented in the training batch as the recency of items. In error rehearsal, recent items are more likely than older ones to be in the training batch, but are not constrained to be.

Time in the processing and in the learning are similar to that described in case study #1 for the SRN.

Change:

The key aspects of change in the MLP model of HNC are due to the changes in the training batch, and are reflected in the generalization during training from the partially trained network to the whole domain. The performance of the initial set as training progressed conflicts with the conventional view of interference: Immediately after training on the initial set, all items were perfectly learned. Further training resulted first in a drop in performance, then later in recovery of the regular items (but not the irregular ones) from the initial set. The recovery is due to to generalization from later items, as the MLP learns the regular structure underlying the letter-phoneme mapping domain.

Discussion and Conclusions:

The power of the HNC model to discover structure in quasi-regular data lies in the use of the training batch as a temporary store to interleave items for training of the MLP. In original studies of catastrophic interference, MLPs were trained on unstructured data which was presented directly to the MLP itself. MMO recognized that the hippocampus may play a role that allowed interleaving of items, and the current simulations by Gray and Wiles studied the effect of quasi-regular data.

The conclusion from these simulations on the HNC analogy is that the interleaving process needs to take account of the rehearsal scheme: When constrained by a limited sized training batch, error rehearsal (but not recency rehearsal) is not sufficient to cope with catastrophic interference.

The objection may reasonably be raised that neocortex does not in any way resemble an MLP. As in all modeling work, the use of a simple design to capture principles of interest of a cognitive system requires much detail to be left out. In this case, the principle is a very specific one concerning the relationship between two complex systems: In MMO's model, the function of hippocampus is as a source of examples for interleaved training of the neocortex, whose function is that of a structured learning system. The simulations discussed here serve as an alert to a problem with MMO's analogy - that irregular items and their regular neighbours can suffer interference after leaving the training set, even when they have been trained by an interleaving system. The error rehearsal scheme serves to demonstrate that the problem is not insurmountable - it may not be the whole (or even part) of the solution to the dilemma of interference in the structured training of the neocortex, but if not, then the dilemma remains as a question to be answered.

The simulations also have bearing on the catastrophic interference debate, purely from a computational point of view: regular combinatorial data has been noted for its massive generalization properties (Brousse & Smolensky, 1989). These simulations show that with error rehearsal, irregular items and their nearest neighbours can also be learned to a high level.

References

McClelland, J.L., McNaughton, B.L., and O'Reilly, R.C. (1994) Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Technical Report PDP.CNS.94.1 ftp://hydra.psy.cmu.edu:/pub/pdp.cns

Brousse, O. and Smolensky, P. (1989). Virtual memories and massive generalization in connectionist combinatorial learning. In Proceedings of the 11th Annual Conference of the Cognitive Science Society, 380-387.