14th Speech in Noise Workshop, 12-13 January 2023, Split, Croatia 14th Speech in Noise Workshop, 12-13 January 2023, Split, Croatia

P17Session 1 (Thursday 12 January 2023, 15:30-17:30)
AVbook, a high-frame-rate corpus of narrative audiovisual speech for investigating multimodal speech perception

Enrico Varano
Imperial College London, UK

Tobias Reichenbach
Friedrich-Alexander-University Erlangen-Nuremberg, Germany

Background: Seeing a speaker's face can help substantially in understanding them, in particular in challenging listening conditions. Research into the neurobiological mechanisms behind audiovisual integration has recently begun to employ continuous natural speech. However, these efforts are impeded by a lack of high-quality audiovisual recordings of a speaker narrating a longer text. Here we seek to close this gap by developing AVbook, an audiovisual speech corpus designed for cognitive neuroscience studies and audiovisual speech recognition. We present the corpus along with behavioural and electroencephalography (EEG) experiments validating its suitability.

Corpus: The corpus consists of 3.6 hours of audiovisual recordings of two speakers, one male and one female, reading 59 passages from a narrative English text. The recordings were acquired at a high frame rate of 119.88 frames per second. The corpus includes a set of multiple-choice questions to test attention to the different passages. A short written summary is also provided for each recording. To enable audiovisual synchronisation when presenting the stimuli, four videos of an electronic clapperboard were recorded with the corpus. We verified the efficacy of the multiple-choice question set in a pilot study; and a method for using the electronic clapperboard, and related results, are also described.

Behavioural and EEG Study: We demonstrate the efficacy of this corpus in an EEG paradigm by presenting results obtained by employing this material to study the integration of the temporal and categorical cues carried by the visual component of speech. We study the effect of talking faces and simplified versions of these visual stimuli on the comprehension of speech in noise.

Results: Behaviourally, we find that visual signals need to contain information beyond the speech envelope to convey a speech-in-noise benefit, with the largest enhancement provided by the natural signals. Employing the EEG data, we further demonstrate that the speech-in-noise benefit is linked to the audiovisual gain in the cortical tracking of the speech envelope in the delta frequency range (word rate), but not in the theta frequency range (syllabic rate).

Conclusions: We present a publicly available corpus to support research into the neurobiology of audiovisual speech processing as well as the development of computer algorithms for audiovisual speech recognition. Employing this corpus in a speech-in-noise EEG paradigm, we evidence a role of the cortical tracking of words in audiovisual speech comprehension.

Last modified 2023-01-06 23:41:06