P10Session 2 (Friday 13 January 2023, 09:00-11:00)The Danish Sentence Test (DAST) corpus of audio and audio-visual recordings of sentences and monologues
A new, larger corpus of Danish sentences has been developed in order to facilitate the development of a new Danish Sentence Test (DAST). The corpus is made up of audio and audio-visual recordings of 1200 linguistically balanced sentences, all of which are spoken by both two male and two female talkers who were either professional actors or experienced narrators. The sentences were constructed using a template-based method that facilitated control over both word frequency and sentence structure, such that there is variation in syntax across sentences in a balanced way. The resulting written sentences were evaluated linguistically in terms of phonetic distributions and naturalness. For the phonetic assessment, each sentence was transcribed into phonemes (Brondsted, Automatic Phonemic Transcriber v1.3), and the resulting distributions show that our template-based approach yields an overall phonetic distribution that is consistent with existing Danish corpora, and moreover, that the distributions between randomly created lists of sentences are reasonably consistent. For the naturalness assessment, the sentences were assessed via an online questionnaire that asked participants (N=814) to assess a total of 30 sentences on (1) how natural they perceived the sentence to be on a 7-point Likert scale and (2) whether the sentence evoked a feeling of discomfort. Each participant read 20 DAST sentences and 10 filler sentences (i.e., sentences known to be of either good or bad quality), which were presented pseudo-randomly so that one random filler was presented after approximately every two randomly chosen DAST sentences. Finally, each of the audio and audio-visual recordings of the sentences from all four of the talkers were assessed qualitatively in terms of the phonetic quality of the recording (e.g., pronunciation and voice quality), the quality of the sound (e.g., presence of background noise), and the quality of the visual component (e.g., facial expression and movement). Besides the sentences from each talker, the corpus also includes 30+ minutes of monologue recordings from each of the same four talkers, including both spontaneous speech and reading aloud from a book. The resulting corpus will be essential for the development of new audio and audio-visual speech-in-noise tests that facilitate more extensive, well-controlled experimental designs of speech intelligibility that have not previously been possible. The corpus and its associated tests will also facilitate novel research directions into audio-visual integration and will be useful in the context of machine learning applications that target the Danish language.
Acknowledgements: Lise Bruun Hansen, Jens Bo Nielsen, Sofie Bundgaard, Amal Abdulqadir Ali, Pernille Holtegaard, Michael Nielsen, Laura Balling, Tobias Andersen, Tobias May, Filip Rønne, David Harbo Jordell, Jens Hjortkjær, and Torsten Dau