[Table of Contents]

Learning Performance of Networks like Elman's Simple Recurrent Networks but having Multiple State Vectors

William H. Wilson
School of Computer Science and Engineering
University of New South Wales
billw@cse.unsw.edu.au


Target Papers:

Introduction:

The cognitive task of the model

The networks described in these papers were designed for a sequence prediction task not unlike that of Elman (1990), but rather than predicting the next word in a sentence, they were trained to predict the next letter in the word (or to predict an end-of-word symbol, when appropriate.) The rationale was that humans are able to recognize a word as "foreign-looking" or grossly misspelled even if they are not familiar with the word. They know, for example, that the letter sequence "fff" does not appear in (English) words, so any "word" containing it must be non-English or mis-spelled. When a letter-predictor trained on English words encounters a sequence of letters which do not occur in English word, this fact will/should be reflected in the predictor's inability to correctly predict the next letter.

The data and input/output task

In the prediction task, words from a lexicon are presented to the network as sequences of letters. Each letter is locally represented (i.e. the input pattern has one bit set and the rest zero for each possible letter + end-of-word symbol). The input to the network is a stream of vectors representing letters + end-of-word concatenated together (e.g., "A C T end-of-word H O M E end-of-word O V E R end-of-word"). For each input vector, the target is to predict the next word in the stream.

The data contained 26 letters + end-of-word, and 373 words totalling 1880 characters including end-of-word characters.

The network

The author's early experiments with using vanilla SRNs found unsatifactorily high total-sum-of-squares error across the training set, so alternative architectures were investigated. The architectures investigated, termed tower networks in Wilson (1995), are best described by comparison with SRNs.

SRNs have input and output layers, a hidden layer, and a context layer or state vector. The context layer has the same number of units as the hidden layer, and the activations of the context layer units are copies of the corresponding hidden layer units from the previous time step. The context layer units are completely interconnected to the hidden layer units by trainable weights.

(Elman-type) tower networks have input and output layers, a hidden layer, and more than one state vector. Each state vector has the same number of units as the hidden layer. The activations of the first state vector are copies of the corresponding hidden layer units from the previous time step. The activations of the k+1st state vector are copies of the corresponding units of the k-th state vector from the previous time step. To put it another way, the activations of the k-th state vector are copies of the corresponding hidden layer units from the k-th previous time step. All state vector units are completely interconnected to the hidden layer units by trainable weights. The weights from the i-th state vector are trained completely independently from the weights from the j-th state vector (for i distinct from j). This distinguishes the scheme from limited-depth backpropagation through time. Figure 1 illustrates an Elman tower.

Wilson (1995) also considered two other classes of network, termed Jordan tower networks (cf. Jordan, 1986), where the the activations of the k-th state vector are copies of the corresponding output layer units from the k-th previous time step, and input-window networks, where the the activations of the k-th state vector are copies of the corresponding input layer units from the k-th previous time step. Input-window networks are architecturally similar to time-delay neural networks (TDNNs) (Waibel, 1989; Lang, Waibel & Hinton, 1990) but are trained in a different manner.

This discussion will focus mostly on the Elman tower networks.

Input/target pairs from the training corpus were presented in sequence to the network, which was trained using backpropagation for 1000 epochs, using a learning rate of 0.05.

Describe how the input/output task addresses the cognitive task

Analogously to Elman's SRN, the output vector that minimizes the error reflects a probability distribution of possible next letters based on the statistics of the training set.

In this sense, the network has learned something about the graphotactic patterns of the training set. See also the section on 'generalization and its causes', below.

However, probably a more significant point of the papers is the fact that using these architectures with extra state vectors allows significant reduction in the error, and a significant increase in the speed of learning (slope of the learning curve). These points are illustrated in Figure 2.


Figure 2: Learning performance of Elman tower nets with different numbers
of state vectors. TSS signifies (Total Sum-of-Squares) Error.

Memory:

Much as in Elman's original simple recurrent nets, information about the underlying graphotactic patterns of the training data is contained solely in the statistics of letter order. The comments in Wiles and McAuley (1996) on the information to be stored, mechanisms of storage, how the mechanisms achieve the memory, and much else, apply to these experiments with tower networks, (after replacing "word" by "letter" and "sentence" by "word"). Their comments need not be repeated here.

Mechanisms of storage

What does need to be discussed is the reasons for the difference in learning performance between straight SRNs and tower networks. What the tower network with n state vectors has, that the SRN does not, is representations of hidden unit activations patterns from up to n time steps ago. The current HU pattern is formed from a weighted sum of the input activations (only one input unit will be active) and the weighted sum of the activations of each unit in each state vector.

Since the current letter is the most significant piece of information in determining what the next letter will be, it is bound to figure large, if in digested form, in the output of the hidden units, and hence in what is copied into the top state vector (or into the (only) state vector in the case of a regular SRN).

After enough time steps, the k-th state vector in an Elman tower will contain information dating from k time steps ago, in which information about the k-th last letter seen will figure large.

How the mechanisms achieve the memory

Inevitably, at least some of the information in the state vector(s) from many time-steps ago must 'decay' away - an SRN or Elman tower might be trained to keep track of particularly cogent pieces of information from long ago, but simple capacity limitations mean that not everything can be stored for ever.

Consequently, an Elman tower, because it has more capacity, and because it has an exact copy of the HU activations from several time steps ago, and thus has access to relatively undecayed information from even longer ago, is able to be trained more rapidly.

The mechanisms of memory in the SRN include a fast memory process based on dynamic changes in the activation values of the HUs and a slow process of learning of the weights.

Describe how the mechanisms achieve the memory

In this respect, there is little to add to the commentary of Wiles and McAuley (1996) other than the obvious comment that everything is made rather more complex by the presence of the extra sets of context units, which affect the trajectory through HU-space, though they have no (direct) effect on the computation of the output vector.

The memory resource in the state vectors themselves is more extensive than in an SRN with the same number h of hidden units. Specifically a k-state-vector tower network has kh state vector units, compared with h for the SRN. The memory resource in the weights from the state vectors to the hidden units is also more extensive: the Elman tower network has khh while the SRN has hh.

Wiles and McAuley regard the state vector in an SRN as a short term memory: this description is even more appropriate for a tower network. The size of this short term memory is larger than for an SRN, but still limited. It more or less guarantees retention of information about the input for as many time steps as there are states. With Elman's (1990) sentence prediction task, two past time steps recall was fine, as the training data consisted of at most three-word sentences, but longer sequences were common with the letter-prediction task of Wilson (1993, 1995): for example, given the sequence S T R ..., the net would need to be able to predict a vowel letter, and given S T R A ..., it would need to be able to predict (based on a modest-sized lexicon, i.e. excluding uncommon words like strabismus and strake principally used in technical contexts and crossword puzzles) the class {D, G, I, N, P, T, W, Y}.

Time:

The major component of time in the study is time as sequence, as with the SRN, and as described by Wiles and McAuley (1996).

Time in the data:

Sequential information in the training data includes (1) the sequential presentation of letters as words; (2) the concatenation of words to form the training corpus; and (3) the repeated presentation (1000 times) through the training set. The representation of time as sequence uses an ordinal time scale in which only the order of the letters is preserved. With respect to point (2) above, note that in some versions of the simulations, the sequences representing words were strictly concatenated, maintaining state vector activations, while in later simulations, the state vectors were zeroed between words. This had a marked effect on the learning curve - see Figure 3. As can be seen, the zeroed nets learned more stably, but could not reduce the error as far as the non-zeroed nets. This is not in itself a cause for concern, as the inter-word information maintained by not zeroing the state vectors between word is not relevant to the task intended to be learned.


Figure 3: Effect of zeroing on learning performance
of otherwise comparable networks. TSS means Error.

Time in the processing:

As with an SRN, processing is in time as a sequence of events: each input item is processed through two layers of weights (input- and context-to-HU and HU-to-output) in one time step, with the HUs being updated using information from the input units and from the state vectors. The target output for each input is the next input in the sequence. The HU vectors are pushed onto a limited-depth stack of state vectors in preparation for subsequent steps in the sequence.

Wiles and McAuley (1996) refer to "the HUs being updated using difference equations." (Earlier in their paper, they note that the distinction between difference and differential equations is what allows significant jumps around in HU space.) Here they are presumably referring to the fact that the HU activation function in an SRN can be expressed (vectorially) as:

hidden(t) = input(t) WIH + hidden(t-1) WCH

where WIH refers to the weight matrix from input to hidden layer, WCH refers to the weight matrix from context or state vector to the hidden layer, input(t) signifies the input vector at time t, and hidden(t) and hidden(t-1) signify the hidden unit vector at times t and t-1. This constitutes a first order difference equation. The analogous equation for a k-state Elman tower network is:

hidden(t) = input(t) WIH + hidden(t-1) WC[1]H + ... + hidden(t-k) WC[k]H

where WC[i]H refers to the weight matrix from the i-th context or state vector to the hidden layer, and hidden(t-i) signifies the hidden unit vector at time t-i (for i = 0, ..., k.) This constitutes a k-th order difference equation.

Time in the learning:

The learning process takes time for incremental changes in the weights. Approximately 100-300 passes through the data set is necessary to accomplish the bulk of the learning, after which learning continues at a slower rate.

Change:

As with the SRN, change in tower networks is due to activations and weights; parameters and training set are constant throughout the simulation.

Change in activations: See the memory section above.

Change in weights: Wiles and McAuley's (1996) remarks apply here: the HU space starts out roughly linear, with small weights and hence in the linear region of the sigmoid units, but as the weights diverge and increase in magnitude, the sigmoids move into their non-linear regions, causing the HU space to grow in complexity.

Structure:

How structure in the world is represented as structure in the model

The structure in the 'world' consists of the graphotactic patterns of English, to the extent that they are exemplified in the training data.

The HU activation values are updated using higher-order difference equations. As with SRNs, these allow the HU trajectory for a sequence of words to visit widely separated regions in HU space, only more so, if anything. Although the papers under consideration do not explicitly address this: the learning algorithm can thus separate the HU vectors corresponding to different graphotactic classes (as suggested by the 'S T R is followed by a vowel' discussion at the section above on how the mechanisms achieve the memory). The relative positions and hierarchical nesting of the corresponding regions in HU-space represent the structure in the world.

What structure is learned

As with Elman's task and representation, the local input and output codings both allow focus on the training of particular weights, as remarked by Wiles and McAuley, and also mean that the letters are presented to the network without any tacit information about relationships that might exist between them. Without wanting to make too much of it, contrast this with the visual similarity of the glyphs p, b, d, all of which represent stop consonants and are used in similar graphotactic situations in words. Letter order is the sole source of information to the network.

Generalization and its causes

Once the program has been trained on examples that include words beginning with, e.g. spl, such as split and splutter, and on words that end in, say, ash, such as crash, lash and thrash, sum squared error is reduced on letters late in word constructed using known beginnings and endings, like splash, even though splash has not occurred in the training set.

Discussion and Conclusions:

In addition to the remarks of Wiles and McAuley (1996), it is clear that the strength of Elman tower networks (along with Jordan tower and input-window networks) lies in two areas:


Figure 4: Comparison of learning between Elman and Jordan tower networks
and input-window networks. TSS means Error.

References

J. L. Elman, Finding Structure in Time, Cognitive Science, 14:179-211, 1990.

Jordan, M.I., Attractor dynamics and parallelism in a connectionist sequential machine, Proceedings of the Eighth Annual Meeting of the Cognitive Science Society, Hillsdale, NJ: Erlbaum, 1986.

Lang, K.J., Waibel, A.H., and Hinton, G.E., A time-delay neural network architecture for isolated word recognition, Neural Networks 3:23-43, 1990.

Waibel, Alex, Modular construction of time-delay neural networks for speech recognition, Neural Computation 1:39-46, 1989.

Wiles, Janet and McAuley, J. Devin, Functional components of the Simple Recurrent Network used in discovery of lexical classes, Case Study 1, Virtual Workshop on Memory, time, change and structure in ANNs: Distilling cognitive models into their functional components: http://psy.uq.oz.au/CogPsych/acnn96/case1.html: 12 March 1996.