[Table of Contents]

Functional components of the Simple Recurrent Network
used in discovery of lexical classes

Janet Wiles and J. Devin McAuley
Departments of Computer Science and Psychology
University of Queensland
janetw@cs.uq.edu.au, devin@psy.uq.edu.au

Target Paper: J. L. Elman, Finding Structure in Time, Cognitive Science, 14:179-211, 1990. Simulation: ``Discovering lexical classes from word order'' pp 194-203.

Introduction:

The cognitive task of the model

Elman's Simple Recurrent Network (SRN) has been used in simulations of many cognitive phenomena. The task considered in this Case Study is the discovery of lexical classes (the grammatical category of a word, such as noun or verb) via a simple sentence prediction task: the network learns to abstract the lexical classes from a training set of simple sentences, with the network architecture and prediction task providing implicit constraints. The classes emerge as a by-product of learning a much simpler task - predicting the next word in a sentence.

The data and input/output task

In the prediction task, sentences from a grammar are presented to the SRN as sequences of words. Each word is locally represented (i.e., the input pattern has one bit set and the rest zero for each possible word). The input to the SRN is a stream of vectors representing sentences concatenated together (e.g., "woman smash plate [.] cat move [.] man break car", Elman, 1990, p. 197). For each input vector, the target is to predict the next word in the stream.

The data contained 29 different lexical items, 10000 two- and three-word sentences with some semantic restrictions on the order of words.

The network

The SRN consists of an input layer, a recurrent hidden unit (HU) layer and an output layer. Both hidden and output units used sigmoid activation functions. The HU layer was unfolded for one time step, creating a copy of the HU vector at the previous time step (called the context layer). Input/target pairs from the training corpus were presented in sequence to the network, which was trained using backpropagation.

Describe how the input/output task addresses the cognitive task

The prediction task provides a specific target (the next word) for each input, however over the whole training set, each input can have many different targets (e.g., dragon smash glass, dragon eat cookie). The output vector that minimizes the error reflects a probability distribution of possible next words based on the statistics of the training set. Elman showed that the similarity structure of the HU vectors reflected a hierarchical clustering of words into syntactic categories such as nouns and verbs, and semantic sub-categories, such as animate and inanimate objects.

Lexical classes in the SRN are a construct of the observer - they are not intrinsic to the network and they can be viewed in different ways: for a particular word in a given position in an input sentence, its lexical class can be approximated by the HU vector that it generates; alternatively, the set of all HU vectors in a given cluster can be interpreted as a lexical class.

Memory:

The information to be stored

Information about the underlying grammar and lexical classes of the training data is contained solely in the statistics of word order, as the static input/output vectors do not provide this information explicitly. The sequence of words provides two sources of information for discovering lexical classes - 1) the set of words that follow the current word, and 2) the set of sequences of words that precede the current word. As described below, both are used in the learning process.

The mechanisms of storage

The mechanisms of memory in the SRN include a fast memory process based on dynamic changes in the activation values of the HUs and a slow process of learning of the weights.

The units in the HU layer form a high-dimensional state space (called HU space). A vector of HU values corresponds to a point in this space, and groups of vectors loosely define regions (corresponding to lexical classes). Changes in HU values constitute state transitions, and sequences of vectors are often referred to as trajectories through HU space.

During the processing of a sequence, the HU activation values are updated as a function of both input- and context-unit vectors. Because of the copy-back of the HU layer to the context layer, the context vector is a consequence of all the previous input and context vectors. This process incorporates new information into a compressed record of the past, allowing every sub-sequence to have a unique representation in HU space. Backpropagation enhances those differences that enable the SRN to minimize error in the prediction task.

Describe how the mechanisms achieve the memory

At every step in processing a sequence, the context vector corresponding to a position in HU space is a form of short term memory that depends on the previous sequence of inputs, and hence contains information about its content. In performing the prediction task for a given sequence, a trajectory is traversed in HU space.

The underlying landscape of HU space is created by the weights from the input and context layers and they correspond to a more permanent form of memory. The landscape is constructed using backpropagation of errors from the two sources mentioned above - successor and predecessors of the current word: In the first source, for any input vector, the resulting output vector is similar to a frequency-weighted average of all output vectors that could follow the recent sequence of inputs; the second source arises because all regions in HU space that share a common target will have a pressure via the backpropagation of error to group into regions. The use of difference (rather than differential) equations in updating the HU activation values is a contributing factor to this process, as it allows large jumps in HU space and hence separation of the regions corresponding to lexical classes. (Incremental change would constrain successive HU vectors to lie in close proximity to one another, rather than allowing grouping by class structure.)

The short term memory properties of the SRN are very flexible, as the HUs combine information from the past sequence (stored in the context units) with new information about the current input item. The SRN does not have the capacity to retain all information from every sequence, however, it is able to learn which information to retain and which to compress: it can retain one particular feature of the input sequence over many time steps if trained to do so, but usually it retains information for a small number time steps, with older information dissipated due to the effects of the non-linearity in the activation function at the HU layer and the weighted sum of the context and input vectors. For this simulation, a retention for three time steps is sufficient as the grammar contains at most three-sentence words.

Even though the transitions through HU space occur in discrete steps like a Finite State Machine (FSM), there are differences in the representation of the underlying states: An FSM has symbolic states, without internal structure or implicit similarity between states (e.g., it is not possible to talk about being halfway between two states in an FSM). By contrast, states in an SRN have an underlying similarity structure by virtue of their representations as distributed vectors in HU space.

Time:

The major component of time in the study is time as sequence, which discards information about the absolute or relative durations of events.

Time in the data:

Sequential information in the training data includes (1) the sequential presentation of words as sentences; (2) the concatenation of sentences to form the training corpus; and (3) the repeated presentation (six times) through the training set. The representation of time as sequence uses an ordinal time scale in which only the relative order of the words is preserved: Each word is considered an event of equal duration with no separation between sentences.

Time in the processing:

Processing is in time as a sequence of events: Each input item is processed through two layers of weights (input- and context-to-HU and HU-to-output) in one time step, with the HUs being updated using difference equations. The target output for each input is the next input in the sequence. The HU vectors are stored as the context units in preparation for the next step in the sequence.

Time in the learning:

The learning process takes time for incremental changes in the weights. However, there is no commitment to the details of this process as maturational change (see Elman, 1993 for exploration of such effects).

One interesting aspect of time in this simulation is that although data representation and network processing are in time, the representation of lexical classes (the overall goal of the study) has no intrinsic temporal component.

Change:

The key aspects of change in the SRN are due to the activations and the weights, with parameters and training set constant throughout the simulation. Elman did not specifically discuss aspects of change in the target paper and simulation, and for this section we provide only a brief discussion, drawing on later understanding of hidden-unit dynamics.

Change in activations: See the memory section above.

Change in weights: The HU space is initially approximately a linear space, as the weights from the input layer start small, and hence the HUs (which are sigmoid) start in their linear range. In addition, many of the weights are redundant. As the weights diverge and increase in magnitude, the sigmoids move into their non-linear regions causing the HU space to grow in complexity. Change in weights is responsible for positioning of attractors, possible bifurcation of the space, and limit cycles.

Structure:

How structure in the world is represented as structure in the model

In this simulation, the structure ``in the world'' consists of the grammar underlying the training data, and it is revealed through the surface forms of the sentences in the training set. The lexical classes to be induced are a component of that grammar. They are represented as clusters of distributed vectors in HU space.

The HU activation values are updated using difference equations. They differ from the continuous differential equations that are typical in modelling physical systems in that the HU vectors can radically change from one time step to the next, which allows the HU trajectory for a sequence of words to visit widely separated regions in HU space. The learning algorithm is thus able to widely separate the HU vectors corresponding to different lexical items, enabling lexical classes to form separate regions within HU space. The relative positions and hierarchical nesting of these regions represent the structure in the world.

What structure is learned

The presentation of the data as a sequence of words allows the SRN to focus on the discovery of structure as classes of words that precede or follow a particular word. Word order is the sole source of information in the data, as the input and output representations encode complete independence of all the lexical items. However, its very poverty is an important characteristic of the data, as local vectors allow a single set of weights to be trained for each input vector (the others all being zero), avoiding other (irrelevant) factors that would interfere with lexical class discovery.

Generalization and its causes

The only generalization tested involved sentences containing a novel (untrained) lexical item ``zog'' substituted for all occurrences of the word ``man''. The hierarchical cluster analysis of HU vectors showed that ``zog'' was placed in a similar position in the cluster diagram to that of ``man'' in the previous one. The novel item was implemented with completely untrained weights from the input-to-HU layer, and hence these could play no role (other than chance) in the similarity observed. The result must have been due to the structure learned in the weights from the context to HU layers.

The sentence corpus in the target simulation was likely to exhaust all possible sentences of the sample grammar used, and hence precluded generalization to novel sentences of existing lexical items. In addition, the simplicity of the grammar precluded testing for other types of generalization such as embedded clauses and agreement of nouns and verbs. In later simulations, more complex aspects of a grammar were shown to be learnable (Elman, 1991), including noun and verb agreement, and the recursive embedding of clauses.

Discussion and Conclusions:

Our task in this Case Study analysis has been to distill the functional components of the SRN used in the discovery of lexical classes by addressing the four target areas of memory, time, change, and structure.

The power of the SRN to discover lexical classes lies in three primary components of the model. First, the sequential presentation of words encoded as local binary vectors provides an impoverished data set that only contains implicit information about lexical classes based on the co-occurrence of words in the training set. Second, architectural constraints of the SRN such as the recurrent combination of each input vector with the hidden-unit activations from the previous time step provides a compressed representation of the sequence of inputs. Third, in order to solve the non-deterministic prediction task, the model is forced to form a distributed memory that is based on the similarity structure of the compressed sequence representations. It is this similarity structure in HU space that reflects the lexical classes of the words.

References

Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-224. Compressed postscript source.

Elman, J.L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71-99. Compressed postscript source.