P03Session 1 (Thursday 12 January 2023, 15:30-17:30)Humanoid robot-assisted speech audiometry for children: Automatic scoring via Kaldi-NL
To assess speech perception in noise, the digits-in-noise (DIN) test is widely used in speech audiometry for normal-hearing and hearing-impaired adults and children. The DIN test consists of a list of 24 digit-triplets, presented in noise at varying intensities. The test currently has two setups: 1. with clinicians running the test and scoring the participants’ spoken responses during the test (clinical setup), and 2. with participants manually entering the digits they heard using a keypad. We propose a third, alternative setup whereby spoken responses are automatically scored using the Dutch instance of the automatic speech recognition (ASR) toolkit Kaldi, hereinafter referred to as Kaldi-NL. We have previously evaluated the performance of the ASR with normal-hearing, native Dutch speaking adults, which showed high accuracy and low word-error rate (WER) when decoding the spoken responses. As a follow-up pilot study, we now evaluate our proposed system’s performance with children’s speech. The aim is to explore our system’s limitations, taking into account that children’s and adults’ speech differ in both prosody and articulation, and that Kaldi-NL is only trained on adult speech. Due to the inherent repetitive nature of the DIN test and the known shorter attention span of children (when compared to adults), and based on the assumption that including a robot in repetitive tasks may induce a more engaging and enjoyable experience, we also propose the addition of a NAO humanoid robot to our DIN-Kaldi-NL setup. Twenty-three normal-hearing, Dutch speaking children aged 5 – 17 years old participated in this study. Results of the accuracy and WER of the ASR showed that our system with children did not perform as well as with adults, as expected due to the training of the ASR model; however, both accuracy and WER did approach adults’ accuracy and WER as children’s age increased. Results also showed that of the 552 triplets presented, 186 presented 289 decoding errors. From such results, it is still unclear to what extent these decoding errors affect the final DIN test score (speech reception threshold, SRT). Therefore, further work includes the comparison of the obtained children’s DIN test SRTs using our proposed setup to those acquired with the clinical setup, as well as exploring if we can improve the ASR performance by either retraining the ASR with children's speech or by implementing prosody modifications (such as pitch and speaking rate) on the recorded children's speech before feeding it to the ASR.