[Table of Contents]

Distributed and Localist Representation in the Brain and in Connectionist Models

Zoltan Schreter
Department of Psychology
University of Tasmania

Abstract

After introducing the concepts of distributed and localist representation, this paper looks at where we can find examples for these types of representations in the brain and in connectionist networks. After describing some advantages and disadvantages of distributed representation, it concludes that a pure distributed representation, as found in the hidden layers of backpropagation type connectionist models, can make these networks problematic as Psychological models.

1. Distributed and localist representations

Distributed representation is one of the most often discussed features of connectionist modellling. But what is distributed representation? The easiest way to define it is by setting it into contrast to non-distributed (or localist or unique) representation. Imagine a one dimensional vector variable, containing 26 elements: [v1, v2, ..., v26]. Also imagine that we want to use this vector to represent letters of the alphabet. One obvious way to do this would be assigning each element to represent one particular letter. E.g., v1 would represent a, v2 would represent b, v3 would represent c, etc. But what would it mean that a particular element represents a particular letter? It would mean that the element that represents the letter would have a value that is characteristic, discriminative for that letter. For example, v1 would have the value 1 if the task is to represent a, but it would have a different value if the task is to represent any other letter. If we allow only values of 1's and 0's, then this would mean that, while representing a, the vector would have a 1 as the value of the first element, and 0's as the values of all other elements: [1,0,...,0]. While representing b, it would have a 1 as the value of the second element, and 0's as the values of all other elements: [0,1,...0], and so on. This type of representation can be called localist because the representations of the letters are localised in separate, specific elements of the vector.

Now imagine another vector, this time with only 8 elements: [v1, v2, v3, v4, v5, v6, v7, v8]. Again, let's assume that vn can only have values of either 1 or 0. This time we cannot use each element of the vector to represent a different letter. What we can do is to use different combinations of 1 values for the eight elements to represent the different letters. For example, we could assign the vector [0,0,0,0,1,1,1,1] to represent the letter a, the vector [0,0,0,1,1,1,1,1] to represent the letter b, and the vector [0,0,1,1,1,1,1,1] to represent the letter c. It is possible to represent all letters of the alphabet in this way - in fact, the eight elements of the vector are able to represent 28 = 256 distinguishable patterns.

The defining characteristic of distributed representations is obvious from the above example: the representations of the letters are not localised in separate, specific elements of the vector: each element can have the value of 1 in the representation of several letters (while it has the value 0 in the representation of the other letters). That is, there are necessarily overlaps in the patterns of 1's representing different letters.

The "distributedness" of representation is not an all-or-none affair: there are different degrees of it. Localist representation - in which each element assumes a value of 1 only for representing one particular thing - is the one extreme: it is the least distributed one. The other extreme is if we want to represent all possible binary patterns using the same vector variable, e.g. 256 patterns in the 8-element vector. If we decrease the number of patterns that we want to represent using the same number of elements, then, usually (that is, if the patterns are a more or less random selection from all possible patterns), each element will assume a value of 1 in less patterns - that is, the representation is "less distributed".

A further type of representation is this: say we want to represent a by the vector [1,1,1,0,0,0,0,0,0,0,...,0], b by the vector [0,0,0,1,1,1,0,0,0,0,...,0], c by the vector [0,0,0,0,0,0,1,1,1,0,...,0], etc. Here, the defining feature of distributed representation is missing: each element participates in the representation of only one letter. Although a letter's representation is "distributed" over several elements, the locations (positions) of the elements are distinct for the letter. Because of this, I shall call this type of representation "quasi-localist".

The human brain can be imagined as an enormous vector of up to 100 billion elements, where each element is a neuron. It is a well accepted fact that many neurons "represent" things that are outside of the brain. For example, the visual cortex contains neurons that are more active if a particular pattern - e.g. a bar of a specific orientation - is present in the visual field than if other patterns are present. Neurons in the somato-sensory cortex represent touch sensation from different parts of the body surface. If we think of neurons in a simplistic way and describe their activity using only the terms "active" (being in the state of maximal frequency of action potentials) and "non-active" (being in the state of minimal frequency of action potentials) , then the connection with the discussion above is clear: each thing is represented in the brain as a vector of binary elements. However, it is immediately clear that this is an oversimplification: neurons are not only active or non-active, but they can have a lot of intermediary states. This corresponds, in the discussion above, to the elements of the vectors having not only binary values, but essentially any value between a maximal value (e.g. 1) and a minimal value (e.g. 0). Obviously, this extends the meaning of distributed representation enormously, as it makes it possible to represent an almost unlimited number of things in any vector of elements.

The elements of the neuronal vector can easily assume new values and the vector is assumed to correspond to transitory pscyhological states: consciousness, short term memory, current percept, current goals, etc.

The neurons of the brain are heavily interconnected, by synaptic connections, through which activity from one neuron is transmitted to another. It is well accepted that synaptic connections can differ in their efficiency of transmitting activity. Because of this, the brain is not only a vector of neuronal elements. It can also be seen as a vector of synaptic elements, with values corresponding to the efficiency of the individual synapses. The values of the elements of the synaptic vector change much more slowly than the values of the neuronal vector. It is widely assumed that long term memory is stored in this vector and that learning means changing the values of some elements of this vector. If the neuronal vector represents things, then what does the synaptic vector represent? Most researchers assume that it represents associations between the representations of things.

Although the relationship of connectionist models to the brain's neural networks is far from clear, and although many connectionist researchers explicitly deny being involved in a serious attempt at modelling the brain, there are some undeniable similarities between natural and artificial neural networks. Just like natural neural nets, connectionist networks consist of a number of relatively simple elements connected to each other. The main thing the elements do is compute an activation value out of input arriving from some elements and transmit this activation value to other elements. Also like in natural neural networks, the connections have transmission efficiency values (often called "weights") associated with them. These weight values are often changed during learning in a connectionist network.

Several questions can be asked at this point:

2. Distributed vs. localist representations in the brain

There is evidence both for localist and distributed representation in the brain. The sensory surface - the "input layer" of the brain - can be seen to contain "representations" of stimulation from the environment. These representations are almost always distributed. For example, one particular "rod" in the eye's retina can be activated in endlessly many light patterns falling on the retina. In the lateral geniculate nucleus (LGN) there is still a rather high level of "localism" in representation: the representation there is a "topological map" of the retina, that is, neighbouring regions of the retina make connections to neighbouring regions in the LGN. However, there is quite some distributedness in that representation, too: the receptive fields of adjecent LGN neurons overlap. This can be compared to, say, wanting to represent a by the vector [1,1,1,0,0,0,0,0], b by the vector [0,1,1,1,0,0,0,0], c by the vector [0,0,1,1,1,0,0,0], etc. At even "higher" levels, in the visual cortex, there is rather much evidence for a type of localist representation. There are cells there that apparently respond preferentially to particular types of stimuli, e.g. to dark bars in a particular orientation, and they do not seem to respond to anything else with that same intensity. This, naturally, does not mean pure localist representation: there is certainly more than one location (neuron) in the cortex that responds preferentially to dark bars oriented at 45 degrees, corresponding to the quasi-localist type representation as defined above. As Churchland and Sejnowski (1992) observe, it is even possible that such orientation sensitive cells are take part in other, until now undiscovered representations. To disprove this possibility, we would have to present the cell with ALL possible light patterns.

3. Distributed vs. localist representation in connectionist networks

One often hears that a characteristic feature of connectionist networks is distributed representation. This is certainly true for the representations in the INPUT layer of most connectionist models (a model in which this is NOT true is the reinforcement learning model for "pole balancing" by Barto, Sutton, and Anderson, 1983, in which each of the 162 input layer units is active only in one particular pattern - that is, each pattern contains only one active input layer unit). One of many examples for distributed representation in the input layer of networks is the Rumelhart-Siple font, used in the Interactive Activation Model, a connectionist model of context effects in reading (McClelland and Rumelhart, 1981). This font contains 14 simple two dimensional features. All 26 letters can be described by indicating the presence or absence of each of these features. The input layer of the model contains - for each of four positions that a letter can have in this model - a unit for each feature, and the presence of a feature is indicated by the corresponding unit being active (activity=1), while the absence is indicated by the unit being inactive (activity=0).

How about the OUTPUT layer? There, we often find distributed representations, too. For example, in the NETtalk model of Sejnowski and Rosenberg (1987) - a network for mapping text input into phonemic output - the phonemes in the output layer have a distributed representation: each phonemic feature is represented by one output layer unit, and each phoneme is represented as a binary activity pattern (activities of 1's and 0's) in the output layer, corresponding to the features that are present (activity of 1) and absent (activity of 0) in a particular phoneme.

However, there are many successful connectionist models which do NOT have a distributed representation in the output layer. One of the best known models with localist output layer representations is the already mentioned Interactive Activation Model by McClelland and Rumelhart (1981). In that model, each output layer unit corresponds to a (4 letter) word. Two other, more recent models - successfully applied to the modelling of different aspects of human category learning - with localist output layers stem from Gluck and Bower (1988) and from Kruschke (1990). A particular category is represented in these models by one particular output layer unit. Connectionist networks in Artificial Intelligence (AI) have also often localist representations in the output layer. For example, the "expert system" network by Bounds (1989), trained to diagnose "lower back pain" (LBP), has an output layer unit for each of four categories of LBP. "Topological map" type networks (e.g. Kohonen, 1988) have groups of output layer units as representations for categories of input patterns. Although there can be an overlap in the representations, generally a group of units representing a particular category does not take part in the representation of other categories: because of this, output layer representations in topological map networks are more localist than distributed. In "competitive learning" type networks - which the topological map networks are a subset of - there is lateral inhibition among the output layer units that allows, after a settling down (relaxation) process, only one or a few of the output layer units to be active; the activity in the rest of those units is suppressed by the inhibition coming from the active units.

In general, in contrast to representations in the input layer, those in the output layers of connectionist models are probably not more often distributed than localist.

Many connectionist models contain - in addition to an input and an output layer - a HIDDEN layer, as well. In fact, the ability of training the connections between the input layer and the hidden layer - acquired since the mid-80's through the discovery of "backpropagation" type learning rules - played an important part in the greatly increased interest in connectionist modelling. In backpropagation, we do not want to determine representations in the hidden layer. We want the network to do this - and the network typically creates highly distributed representations. There are many models, both in AI and in Psychology, that have a hidden layer with distributed representations (e.g. Sejnowski and Rosenberg, 1987; McCloskey and Cohen, 1988; Cohen, Dunbar, and McClelland, 1990, etc.). In fact, for many people, distributed representation is probably almost equivalent with hidden layer representations. There are, however, quite a number of connectionist models without hidden layers or with hidden layers but without distributed representations in them.

One well known example of models without a hidden layer is the above mentioned topological map network of Kohonen. In fact, most models with an "unsupervised" learning algorithm - a type of learning in which the network is left to itself in determining the representations of pattern categories in the output layer - like the ART model of Grossberg (Carpenter and Grossberg, 1988), one of the competitive learning models described in Rumelhart and Zipser (1986), and also Kohonen (1988), etc., have only two layers of units. The prototypical "supervised" learning model with only two layers is the perceptron. Not many two layer perceptron models have been used within AI since the discovery of backpropagation, but some were applied to psychological modelling. One example for this is the above mentioned model by Gluck and Bower (1988), another one is a network that models the learning of the past tense of English verbs (Rumelhart and McClelland, 1986).

There are some models with a hidden layer, but with localist representations in that layer. One example is the Interactive Activation Model: between the input layer - containing units representing letter features - and the output layer - containing units that represent words - there is a layer of units each of which represents a particular letter (in one of four positions). In another model - the "counterpropagation" network (Hecht-Nielsen, 1987) - the hidden layer is trained by a Kohonen/competitive learning type algorithm, with the result that each unit represents a different category. Another example is the three layer competitive learning network presented by Rumelhart and Zipser (1986). In this model both the hidden layer and the output layer are trained by the same unsupervised learning algorithm. Again, each unit in the hidden layer represents a different input pattern category. A further example is the ALCOVE model by Kruschke (1990). In ALCOVE, the hidden layer contains an often large number of units. These units differ from the hidden layer units in many other networks in that they have a "radial basis" activation function, in contrast to the more standard "sigmoid" activation function. In this network, any particular input pattern category activates only a small group of the hidden layer units, and those hidden layer units respond only to patterns belonging to that particular category - that is, we have, similarly as in the Topological Map network, a sort of quasi-localist representation.

Summarising the discussion of representations in the hidden layer, one can say that - although quite a number of connectionist models have hidden layers with distributed representations in them - there are also quite a few recent models either with no hidden layer at all or with a hidden layer in which representation is localistic to a large degree.

4. Advantages and disadvantages of distributed and localist representations

Lashley (1950) selectively destroyed different parts of the brains of rats and observed the effect of this on the behaviour of the animals. He abstracted two principles from these observations: "mass action" and "equipotentiality". "Mass action" means that seriousness of behavioural disturbance is positively related to the amount of brain damage. That is, there is a "graceful degradation" instead of a catastrophic breakdown of behavioural performance, as disturbance increases. "Equipotentiality" means that memories in the cortex cannot be localised in a particular place but are distributed over extended regions of it.

"Mass action", "graceful degradation", and "equipotentiality" are often taken to be signs of distributed representation in the brain. However, we would expect these characteristics to be also present in the sort of quasi-localist representation mentioned above, if the neurons, taking part in the representation of one particular memory, are randomly distributed over the cortex (Feldman, 1981).

"Mass action" and "equipotentiality" could also be demonstrated in several connectionist models (e.g. Wood, 1978; Sejnowski and Rosenberg, 1987). Sejnowski and Rosenberg (1987) "lesioned" NETtalk - a network with a hidden layer and distributed input and output layer representations - by changing its learned connection weights. "Mass action" and "graceful degradation" in the lesion experiments with NETtalk might have had to do with distributed representations in that network. But a hidden layer and distributed representations in it are not necessary for "mass action" and "graceful degradation" to occur: the network model of Wood (1978), that had no hidden layer and had distributed input and output layer representations, showed these features after "lesioning" (removing units from representations).

Another, in this case real, advantage of distributed representations is that - as illustrated at the beginning of this paper - they are economical: they need a lot less elements, units, than localist representations do.

The degree of distributedness of representation in a connectionist network can affect its ability to model Psychological phenomena. One good example for this is the controversy about "catastrophic interference" in connectionist networks of the backpropagation type with a hidden layer (McCloskey and Cohen, 1989; Ratcliff, 1990). "Catastrophic interference" means that that network can quickly forget much of its knowledge that it has built up during previous training, if it is trained to acquire new knowledge while not "rehearsing" the old knowledge at the same time. This sounds like forgetting by interference - one of the main ideas about how forgetting occurs in humans - but actually the networks forget much faster and more profoundly than what can be expected in humans. Optimally, there should be some forgetting by the interference of new learning with old knowledge, but it should be much less than observed by McCloskey and Cohen (1989) and by Ratcliff (1990).

This problem has to do with distributed representation, and it happens even in very small networks without a hidden layer, as demonstrated by McCloskey and Cohen, if there is distributed representation in the input layer. The problem is, naturally, exacerbated in three-layer networks with distributed representations in the hidden layer.

The trick for solving this problem, correspondingly, is to reduce the distributedness of representation in the network. What is really necessary, and in fact sufficient, is to let the network somehow build up separate representations in the hidden layer for different things the network learns. This is the essence of several of the solutions to the problem that I know of. The ALCOVE model (Kruschke, 1990) overcomes "catastrophic interference" because of the quasi-localist representation in its hidden layer. Sloman and Rumelhart (1992) let the network select - for each "episode" - a different subset of the hidden layer units, resulting, again, in a localisation of all memories that belonged to the episode. And I found that a network with no hidden layer was much less disturbed by "catastrophic interference" than networks with hidden layers (Schreter, 1993).

5. Summary

Distributed representations are an interesting and important means of storing information. They gained much prominence recently among researchers interested in connectionist models. However, I think that we should be aware of the fact that distributed representations - in particular as they appear in the hidden layers of backpropagation type networks - seem to have disadvantages for modelling some Psychological phenomena, in particular sequential learning. In many other cases they do not seem to be necessary - at least in parts of the network other than the input layer - for building a successful model. In the long term, probably the best solution are networks with a hidden layer containing representations with decreased but not fully eliminated distributedness.

6. References:

Barto, A.G., Sutton, R.S. & Anderson, C.W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems , Man, and Cybernetics, Vol. SMC-13, N0. 5, September/October 1983, 834-846.

Bounds, D.G. (1989). Expert systems and connectionist networks. In Pfeifer, R., Schreter, Z., Fogelman-Soulie, F.,& Steels, L. (Eds.). Connectionism in perspective. Amsterdam: Elsevier, 277- 282.

Carpenter, G. & Grossberg, S. (1988). The ART of adaptive pattern recognition by a self- organizing neural network. Computer, 21 (3), 77-88.

Churchland, P.S. & Sejnowski, T.J. (1992). The computational brain. Cambridge, MA: The MIT Press.

Cohen, J.D. , Dunbar, K., & McClelland, J.L. (1990). On the control of automatic processes: A parallel distributed model of the Stroop effect. Psychological Review, 97, 3, 332-361.

Feldman, J.A. (1981). Memory and change in connection networks. University of Rochester, Department of Computer Science, TR96.

Gluck, M.A. & Bower, G.H.(1988). Evaluating an adaptive network model of human learning. Journal of Memory and Language, 27,2,166-195.

Hecht-Nielsen, R. (1987). Counterpropagation networks. Applied Optics, 26, 4979-4984.

Kohonen, T.(1988b). The "neural" phonetic typewriter. Computer, 21(3), 11-24.

Kruschke, J.K. (1990). ALCOVE: A connectionist model of category learning. Research Report 19, Cognitive Science Program, Indiana University.

Lashley, K.(1950). In search of the engramm. Society of Experimental Biology, 1950, Symposium 4, 454-482.

McClelland, J.L., & Rumelhart, D.E.(1981). An interactive model of context effects in letter perception: Part 1. An account of basic findings. Psychological Review, 88, 375-407.

McCloskey, M. & Cohen, N.J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In Bower, G.H.(Ed.). The psychology of learning and motivation: Volume 23. New York: Academic Press.

Ratcliff, R. (1990). Connectionist models and recognition memory: Constraints imposed by learning and forgetting functions. Psaychological Review, 97, No. 2, 285-308.

Rumelhart, D.E. & McClelland, J.L.(1986). On learning the past tenses of English verbs. In McClelland, J.L., Rumelhart, D.E., and the PDP Research Group (1986). Parallel Distributed Processing. vol 2: Psychological and Biological Models. Cambridge: MIT-Press, 216-271.

Rumelhart, D.E. & Zipser, D. (1986). Feature discovery by competitive learning. In Rumelhart, D.E., McClelland, J.L. and the PDP Research Group (1986). Parallel Distributed Processing. vol.1, Foundations. Cambridge: MIT-Press, 151-193.

Schreter, Z. (1993). Modelling proactive and retroactive interference with connectionist networks. Proceedings of the Second Australian Cognitive Science Conference, Melbourne, 55-57.

Sejnowski, T.J., & Rosenberg, C.R. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1,145-168.

Sloman, S.A., & Rumelhart, D.E. (1992). Reducing interference in distributed memories through episodic gating. In Healy, A.F.,Kosslyn, S.M., & Shiffrin, R.M. (Eds.) From learning theory to connectionist theory: Essays in honor of William K. Estes, Vol. 1, Hillsdale: Lawrence Erlbaum, p. 227-248.

Wood, C.C. (1978). Variations on a theme by Lashley: Lesion experiments on the neural model of Anderson, Silverstein, Ritz, and Jones. Psychological Review, 85, 6, 582-591.