[Table of Contents]

Auditory Scene Analysis via Emergent Synchrony

J. Devin McAuley
Department of Psychology
University of Queensland
devin@psy.uq.edu.au

Target Paper: DeLiang Wang, Primitive Auditory Segregation Based On Oscillatory Correlation, Cognitive Science (in press). Simulation: "Stream segregation with alternating high/low tones".

Introduction:

The cognitive task of the model

Auditory segregation (or auditory scene analysis) refers to the process by which listeners are able to separate an acoustic signal into its different sources. For example, listeners apply auditory scene analysis (ASA) to separate the voices of simultaneous speakers.

In studying the separation of sound sources, primitive ASA has been used to describe innate biases, and schema-based ASA has been used to emphasize the role of pattern learning. In the target paper, Wang addresses one aspect of primitive ASA, focusing on modelling the effects of pitch and time differences on the segmentation of tone sequences. The specific phenomena addressed in the simulation is whether an alternating sequence of high (H) and low (L) tones, indicated here as HLHLHL, will be perceived as an integrated sequence or as two separate sequences, consisting of high tones only and low tones only. Large frequency velocities (pitch differences/ time differences between the H and L tones) tends to induce streaming into separate sequences, although this effect is mediated by a number of additional factors, including whether the listener is instructed to integrate the sequence or to selectively attend to the high or low tones (Van Noorden, 1975; Jones and Yee, 1993).

The network

The architecture of the model is a two-dimensional network of relaxation oscillators with lateral excitory connections and a global inhibitor. The two dimensions of the network correspond to frequency and time separation, with the strength of local excitation between oscillators a function of the frequency and time "distance." The time dimension is implemented as a series of delay lines.

The data and input/output task

The input to the model consists of NxM binary matrices with dimensions of frequency and time. Each input matrix represents an auditory sequence, with binary elements corresponding to time/frequency events that are either on or off. In simulating auditory streaming, sequences of alternating high and low tones are unrolled in time so that they can be represented as a time-frequency input matrix, similar to a primitive spectrogram. The network task is simply to respond to each input matrix, with active time-frequency events triggering the corresponding time-frequency oscillators.

Describe how the input/output task addresses the cognitive task

Stream segregation is achieved via emergent synchrony between oscillators that are simultaneously active. A tone sequence forms a stream if and only if the time-frequency elements that represent that sequence trigger oscillators that become synchronized. Separation of the sequence into H- and L-tone streams occurs when the H-tone synchronized oscillators become out of phase with respect to the L-tone synchronized oscillators. Oscillator synchrony is achieved through hard-wired exicitory connections between oscillators that are a function of time and frequency distance.

Time:

Time in the data

Time is represented in the data as an absolute measurement. The input sequences are unrolled for a pre-determined number of time steps, translating time into a spatial dimension; each time step corresponds to a 40-ms slice. For the H/L sequences, the tone duration is a fixed number of 40-ms slices, with a single 40-ms silent gap between tones. Changing the rate (tempo) of the input sequences is accomplished by modifying the tone duration; the silent gap between tones remains fixed.

Time in the processing

The input sequences are processed by the oscillator network via a series of implicit 40-ms delay lines, each corresponding to a specific 40-ms input slice along the frequency dimension. As a result, the input sequences are processed by the network all-at-once, in the same way that at ANN would process a static image; in fact, Terman and Wang (1995) used essentially the same model to segment a visual scene. The maximum number of time steps is hard-wired into the network architecture, limiting the length of the input sequences that can be processed. For example, in a 15x30 network, there are 15 frequency-specific oscillators for each of the 30 40-ms time delays, resulting in a maximum sequence duration of 1200 ms.

Memory:

In processing auditory sequences, the Wang model maintains a memory of auditory-sequence events via the series of delay lines, so that sequences are processed by the oscillator network all-at-once. Stream segregation via oscillator synchrony is only maintained for as long as the input sequence persists. Moreover, auditory streams do not persist in memory in the absence of input, and stream formation does not rely on memory storage and retrieval. Since the model does not attempt to account for schema-based ASA which depends on sequence learning, and instead only applies to primitive ASA. Memory is not a key issue in the model.

Change:

There are two levels of change in the Wang model: change in the activations of the oscillators, and change in the weights that modulate the interaction between oscillators. These two types of change specify the dynamics of oscillator sychronization, and thus determine the formation of auditory streams. There is no sequence learning in the model. As a result, sequence processing is not influenced by the previous sequences that the model has been exposed to.

Structure:

In order to address the target area of structure, it is important to clarify what structure means for the cognitive task of the model. For ASA, structure refers to the segmentation of an auditory scene into its parts (or objects). The model represents the structure of an auditory scene by the input sequence of time-frequency events. In modelling listeners segregation of auditory sequences into high and low-tone streams, the primary question for the cognitive modeller concerns how well the oscillator-sychronization behavior of the model captures the auditory scene structure perceived by the listener for the alternating sequences of high and low tones.

For the target simulation, there are four main results: (1) Sequences presented at a fast rate with large frequency separation between the high and low tones result in the formation of high- and low-tone streams; (2) Small frequency separation at the same rate results in an integrated sequence percept of alternating high and low tones; (3) Small frequency separation at slow rates also results in an integrated sequence percept; And (4), large frequency separation at slow presentation rates results in each tone in the sequence forming a separate stream. The first three of these results are consistent with psychological data on streaming. The fourth result is not consistent with psychological data, and thus poses a problem for the model.

Although this simulation exhibits streaming phenomena that are qualitatively similar to listener in some respects (results 1 - 3), the model paints a over-simplified picture of streaming by assuming that time and frequency variables are essentially independent. In contrast, stream segregation by listeners seems to be a joint function of frequency and time differences, described by Jones and Yee (1993) as a dependence on frequency motion. In addition, the length of sequences that the Wang model can process is constrained by the network architecture, and shortening the duration of the sequence tones results in silent gaps between tones that may prevent groups of oscillator's from attaining synchrony, whereas these are cases for which listeners would perceive a coherent stream.

Discussion and Conclusions:

The task in this Case Study analysis has been to distill the functional components of Wang's oscillator network used for primitive auditory scene analysis by addressing the target areas of memory, time, change, and structure.

The power of Wang's oscillator network to model streaming lies in three primary components of the model. First, auditory sequences are translated into a spatial array via a series of delay lines, so that sequences of time-frequency events are processed all-at-once. Second, lateral excitory connections between oscillators based on time and frequency proximity enable oscillators corresponding to time-frequency events to achieve synchrony if and only if they are sufficiently near in time and frequency. Third, the global inhibitor counteracts lateral excitation by desychronizing oscillators. It is the competition between the sychronizing lateral connections and the desychronizing global inhibitor that permits the model to form streams based on time and frequency proximity, providing a successful model of the basic phenomena. However, as discussed above, this model fails to capture significant aspects of listeners' streaming of alternating sequences of high and low tones, calling into question the basic assumptions that determines its computational power.

References

Jones, M. and Yee, W. (1993). Attending to Auditory Events: The Role of Temporal Organization. In McAdams, S. and Bigand, E., editors, Thinking in Sound: The Cognitive Psychology of Human Audition, pages 69-112. Oxford University Press.

Terman, D. and Wang, D.L. (1995). Global Competition and Local Cooperation in a Network of Neural Oscillators. Physica D 81, 148-176.

van Noorden, L. (1975). Temporal Coherence in the Perception of Tone Sequences. Ph. D. Thesis, Eindhoven University of Technology, The Netherlands.