Now imagine another vector, this time with only 8 elements: [v1, v2, v3, v4, v5, v6, v7, v8]. Again, let's assume that vn can only have values of either 1 or 0. This time we cannot use each element of the vector to represent a different letter. What we can do is to use different combinations of 1 values for the eight elements to represent the different letters. For example, we could assign the vector [0,0,0,0,1,1,1,1] to represent the letter a, the vector [0,0,0,1,1,1,1,1] to represent the letter b, and the vector [0,0,1,1,1,1,1,1] to represent the letter c. It is possible to represent all letters of the alphabet in this way - in fact, the eight elements of the vector are able to represent 28 = 256 distinguishable patterns.
The defining characteristic of distributed representations is obvious from the above example: the representations of the letters are not localised in separate, specific elements of the vector: each element can have the value of 1 in the representation of several letters (while it has the value 0 in the representation of the other letters). That is, there are necessarily overlaps in the patterns of 1's representing different letters.
The "distributedness" of representation is not an all-or-none affair: there are different degrees of it. Localist representation - in which each element assumes a value of 1 only for representing one particular thing - is the one extreme: it is the least distributed one. The other extreme is if we want to represent all possible binary patterns using the same vector variable, e.g. 256 patterns in the 8-element vector. If we decrease the number of patterns that we want to represent using the same number of elements, then, usually (that is, if the patterns are a more or less random selection from all possible patterns), each element will assume a value of 1 in less patterns - that is, the representation is "less distributed".
A further type of representation is this: say we want to represent a by the vector [1,1,1,0,0,0,0,0,0,0,...,0], b by the vector [0,0,0,1,1,1,0,0,0,0,...,0], c by the vector [0,0,0,0,0,0,1,1,1,0,...,0], etc. Here, the defining feature of distributed representation is missing: each element participates in the representation of only one letter. Although a letter's representation is "distributed" over several elements, the locations (positions) of the elements are distinct for the letter. Because of this, I shall call this type of representation "quasi-localist".
The human brain can be imagined as an enormous vector of up to 100 billion elements, where each element is a neuron. It is a well accepted fact that many neurons "represent" things that are outside of the brain. For example, the visual cortex contains neurons that are more active if a particular pattern - e.g. a bar of a specific orientation - is present in the visual field than if other patterns are present. Neurons in the somato-sensory cortex represent touch sensation from different parts of the body surface. If we think of neurons in a simplistic way and describe their activity using only the terms "active" (being in the state of maximal frequency of action potentials) and "non-active" (being in the state of minimal frequency of action potentials) , then the connection with the discussion above is clear: each thing is represented in the brain as a vector of binary elements. However, it is immediately clear that this is an oversimplification: neurons are not only active or non-active, but they can have a lot of intermediary states. This corresponds, in the discussion above, to the elements of the vectors having not only binary values, but essentially any value between a maximal value (e.g. 1) and a minimal value (e.g. 0). Obviously, this extends the meaning of distributed representation enormously, as it makes it possible to represent an almost unlimited number of things in any vector of elements.
The elements of the neuronal vector can easily assume new values and the vector is assumed to correspond to transitory pscyhological states: consciousness, short term memory, current percept, current goals, etc.
The neurons of the brain are heavily interconnected, by synaptic connections, through which activity from one neuron is transmitted to another. It is well accepted that synaptic connections can differ in their efficiency of transmitting activity. Because of this, the brain is not only a vector of neuronal elements. It can also be seen as a vector of synaptic elements, with values corresponding to the efficiency of the individual synapses. The values of the elements of the synaptic vector change much more slowly than the values of the neuronal vector. It is widely assumed that long term memory is stored in this vector and that learning means changing the values of some elements of this vector. If the neuronal vector represents things, then what does the synaptic vector represent? Most researchers assume that it represents associations between the representations of things.
Although the relationship of connectionist models to the brain's neural networks is far from clear, and although many connectionist researchers explicitly deny being involved in a serious attempt at modelling the brain, there are some undeniable similarities between natural and artificial neural networks. Just like natural neural nets, connectionist networks consist of a number of relatively simple elements connected to each other. The main thing the elements do is compute an activation value out of input arriving from some elements and transmit this activation value to other elements. Also like in natural neural networks, the connections have transmission efficiency values (often called "weights") associated with them. These weight values are often changed during learning in a connectionist network.
Several questions can be asked at this point:
How about the OUTPUT layer? There, we often find distributed representations, too. For example, in the NETtalk model of Sejnowski and Rosenberg (1987) - a network for mapping text input into phonemic output - the phonemes in the output layer have a distributed representation: each phonemic feature is represented by one output layer unit, and each phoneme is represented as a binary activity pattern (activities of 1's and 0's) in the output layer, corresponding to the features that are present (activity of 1) and absent (activity of 0) in a particular phoneme.
However, there are many successful connectionist models which do NOT have a distributed representation in the output layer. One of the best known models with localist output layer representations is the already mentioned Interactive Activation Model by McClelland and Rumelhart (1981). In that model, each output layer unit corresponds to a (4 letter) word. Two other, more recent models - successfully applied to the modelling of different aspects of human category learning - with localist output layers stem from Gluck and Bower (1988) and from Kruschke (1990). A particular category is represented in these models by one particular output layer unit. Connectionist networks in Artificial Intelligence (AI) have also often localist representations in the output layer. For example, the "expert system" network by Bounds (1989), trained to diagnose "lower back pain" (LBP), has an output layer unit for each of four categories of LBP. "Topological map" type networks (e.g. Kohonen, 1988) have groups of output layer units as representations for categories of input patterns. Although there can be an overlap in the representations, generally a group of units representing a particular category does not take part in the representation of other categories: because of this, output layer representations in topological map networks are more localist than distributed. In "competitive learning" type networks - which the topological map networks are a subset of - there is lateral inhibition among the output layer units that allows, after a settling down (relaxation) process, only one or a few of the output layer units to be active; the activity in the rest of those units is suppressed by the inhibition coming from the active units.
In general, in contrast to representations in the input layer, those in the output layers of connectionist models are probably not more often distributed than localist.
Many connectionist models contain - in addition to an input and an output layer - a HIDDEN layer, as well. In fact, the ability of training the connections between the input layer and the hidden layer - acquired since the mid-80's through the discovery of "backpropagation" type learning rules - played an important part in the greatly increased interest in connectionist modelling. In backpropagation, we do not want to determine representations in the hidden layer. We want the network to do this - and the network typically creates highly distributed representations. There are many models, both in AI and in Psychology, that have a hidden layer with distributed representations (e.g. Sejnowski and Rosenberg, 1987; McCloskey and Cohen, 1988; Cohen, Dunbar, and McClelland, 1990, etc.). In fact, for many people, distributed representation is probably almost equivalent with hidden layer representations. There are, however, quite a number of connectionist models without hidden layers or with hidden layers but without distributed representations in them.
One well known example of models without a hidden layer is the above mentioned topological map network of Kohonen. In fact, most models with an "unsupervised" learning algorithm - a type of learning in which the network is left to itself in determining the representations of pattern categories in the output layer - like the ART model of Grossberg (Carpenter and Grossberg, 1988), one of the competitive learning models described in Rumelhart and Zipser (1986), and also Kohonen (1988), etc., have only two layers of units. The prototypical "supervised" learning model with only two layers is the perceptron. Not many two layer perceptron models have been used within AI since the discovery of backpropagation, but some were applied to psychological modelling. One example for this is the above mentioned model by Gluck and Bower (1988), another one is a network that models the learning of the past tense of English verbs (Rumelhart and McClelland, 1986).
There are some models with a hidden layer, but with localist representations in that layer. One example is the Interactive Activation Model: between the input layer - containing units representing letter features - and the output layer - containing units that represent words - there is a layer of units each of which represents a particular letter (in one of four positions). In another model - the "counterpropagation" network (Hecht-Nielsen, 1987) - the hidden layer is trained by a Kohonen/competitive learning type algorithm, with the result that each unit represents a different category. Another example is the three layer competitive learning network presented by Rumelhart and Zipser (1986). In this model both the hidden layer and the output layer are trained by the same unsupervised learning algorithm. Again, each unit in the hidden layer represents a different input pattern category. A further example is the ALCOVE model by Kruschke (1990). In ALCOVE, the hidden layer contains an often large number of units. These units differ from the hidden layer units in many other networks in that they have a "radial basis" activation function, in contrast to the more standard "sigmoid" activation function. In this network, any particular input pattern category activates only a small group of the hidden layer units, and those hidden layer units respond only to patterns belonging to that particular category - that is, we have, similarly as in the Topological Map network, a sort of quasi-localist representation.
Summarising the discussion of representations in the hidden layer, one can say that - although quite a number of connectionist models have hidden layers with distributed representations in them - there are also quite a few recent models either with no hidden layer at all or with a hidden layer in which representation is localistic to a large degree.
"Mass action", "graceful degradation", and "equipotentiality" are often taken to be signs of distributed representation in the brain. However, we would expect these characteristics to be also present in the sort of quasi-localist representation mentioned above, if the neurons, taking part in the representation of one particular memory, are randomly distributed over the cortex (Feldman, 1981).
"Mass action" and "equipotentiality" could also be demonstrated in several connectionist models (e.g. Wood, 1978; Sejnowski and Rosenberg, 1987). Sejnowski and Rosenberg (1987) "lesioned" NETtalk - a network with a hidden layer and distributed input and output layer representations - by changing its learned connection weights. "Mass action" and "graceful degradation" in the lesion experiments with NETtalk might have had to do with distributed representations in that network. But a hidden layer and distributed representations in it are not necessary for "mass action" and "graceful degradation" to occur: the network model of Wood (1978), that had no hidden layer and had distributed input and output layer representations, showed these features after "lesioning" (removing units from representations).
Another, in this case real, advantage of distributed representations is that - as illustrated at the beginning of this paper - they are economical: they need a lot less elements, units, than localist representations do.
The degree of distributedness of representation in a connectionist network can affect its ability to model Psychological phenomena. One good example for this is the controversy about "catastrophic interference" in connectionist networks of the backpropagation type with a hidden layer (McCloskey and Cohen, 1989; Ratcliff, 1990). "Catastrophic interference" means that that network can quickly forget much of its knowledge that it has built up during previous training, if it is trained to acquire new knowledge while not "rehearsing" the old knowledge at the same time. This sounds like forgetting by interference - one of the main ideas about how forgetting occurs in humans - but actually the networks forget much faster and more profoundly than what can be expected in humans. Optimally, there should be some forgetting by the interference of new learning with old knowledge, but it should be much less than observed by McCloskey and Cohen (1989) and by Ratcliff (1990).
This problem has to do with distributed representation, and it happens even in very small networks without a hidden layer, as demonstrated by McCloskey and Cohen, if there is distributed representation in the input layer. The problem is, naturally, exacerbated in three-layer networks with distributed representations in the hidden layer.
The trick for solving this problem, correspondingly, is to reduce the distributedness of representation in the network. What is really necessary, and in fact sufficient, is to let the network somehow build up separate representations in the hidden layer for different things the network learns. This is the essence of several of the solutions to the problem that I know of. The ALCOVE model (Kruschke, 1990) overcomes "catastrophic interference" because of the quasi-localist representation in its hidden layer. Sloman and Rumelhart (1992) let the network select - for each "episode" - a different subset of the hidden layer units, resulting, again, in a localisation of all memories that belonged to the episode. And I found that a network with no hidden layer was much less disturbed by "catastrophic interference" than networks with hidden layers (Schreter, 1993).
Bounds, D.G. (1989). Expert systems and connectionist networks. In Pfeifer, R., Schreter, Z., Fogelman-Soulie, F.,& Steels, L. (Eds.). Connectionism in perspective. Amsterdam: Elsevier, 277- 282.
Carpenter, G. & Grossberg, S. (1988). The ART of adaptive pattern recognition by a self- organizing neural network. Computer, 21 (3), 77-88.
Churchland, P.S. & Sejnowski, T.J. (1992). The computational brain. Cambridge, MA: The MIT Press.
Cohen, J.D. , Dunbar, K., & McClelland, J.L. (1990). On the control of automatic processes: A parallel distributed model of the Stroop effect. Psychological Review, 97, 3, 332-361.
Feldman, J.A. (1981). Memory and change in connection networks. University of Rochester, Department of Computer Science, TR96.
Gluck, M.A. & Bower, G.H.(1988). Evaluating an adaptive network model of human learning. Journal of Memory and Language, 27,2,166-195.
Hecht-Nielsen, R. (1987). Counterpropagation networks. Applied Optics, 26, 4979-4984.
Kohonen, T.(1988b). The "neural" phonetic typewriter. Computer, 21(3), 11-24.
Kruschke, J.K. (1990). ALCOVE: A connectionist model of category learning. Research Report 19, Cognitive Science Program, Indiana University.
Lashley, K.(1950). In search of the engramm. Society of Experimental Biology, 1950, Symposium 4, 454-482.
McClelland, J.L., & Rumelhart, D.E.(1981). An interactive model of context effects in letter perception: Part 1. An account of basic findings. Psychological Review, 88, 375-407.
McCloskey, M. & Cohen, N.J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In Bower, G.H.(Ed.). The psychology of learning and motivation: Volume 23. New York: Academic Press.
Ratcliff, R. (1990). Connectionist models and recognition memory: Constraints imposed by learning and forgetting functions. Psaychological Review, 97, No. 2, 285-308.
Rumelhart, D.E. & McClelland, J.L.(1986). On learning the past tenses of English verbs. In McClelland, J.L., Rumelhart, D.E., and the PDP Research Group (1986). Parallel Distributed Processing. vol 2: Psychological and Biological Models. Cambridge: MIT-Press, 216-271.
Rumelhart, D.E. & Zipser, D. (1986). Feature discovery by competitive learning. In Rumelhart, D.E., McClelland, J.L. and the PDP Research Group (1986). Parallel Distributed Processing. vol.1, Foundations. Cambridge: MIT-Press, 151-193.
Schreter, Z. (1993). Modelling proactive and retroactive interference with connectionist networks. Proceedings of the Second Australian Cognitive Science Conference, Melbourne, 55-57.
Sejnowski, T.J., & Rosenberg, C.R. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1,145-168.
Sloman, S.A., & Rumelhart, D.E. (1992). Reducing interference in distributed memories through episodic gating. In Healy, A.F.,Kosslyn, S.M., & Shiffrin, R.M. (Eds.) From learning theory to connectionist theory: Essays in honor of William K. Estes, Vol. 1, Hillsdale: Lawrence Erlbaum, p. 227-248.
Wood, C.C. (1978). Variations on a theme by Lashley: Lesion experiments on the neural model of Anderson, Silverstein, Ritz, and Jones. Psychological Review, 85, 6, 582-591.