Recent research has demonstrated that adults successfully segment two interleaved artificial speech streams with incongruent statistics (i.e., streams whose combined statistics are noisier than the encapsulated statistics) only when provided with an indexical cue of speaker voice. In a series of five experiments, our study explores whether learners can utilise visual information to encapsulate statistics for each speech stream. We initially presented learners with incongruent artificial speech streams produced by the same female voice along with an accompanying visual display. Learners successfully segmented both streams when the audio stream was presented with an indexical cue of talking faces (Experiment 1). This learning cannot be attributed to the presence of the talking face display alone, as a single face paired with a single input stream did not improve segmentation (Experiment 2). Additionally, participants failed to successfully segment two streams when they were paired with a synchronised single talking face display (Experiment 3). Likewise, learners failed to successfully segment both streams when the visual indexical cue lacked audio-visual synchrony, such as changes in background screen colour (Experiment 4) or a static face display (Experiment 5). We end by discussing the possible relevance of the speaker's face in speech segmentation and bilingual language acquisition.
All Science Journal Classification (ASJC) codes
- Experimental and Cognitive Psychology
- Language and Linguistics
- Linguistics and Language