Colin Cherry Award 2020
Handling classroom babble noise in an automatic speech recognition-powered educational application
Background: In the scope of the Lalilo reading assistant for 5-7-year-old children learning to read, designed for classroom usage, we develop an automatic speech recognition (ASR) system that aims at transcribing accurately children’s read speech and detecting potential reading mistakes. In addition to challenges linked to the recognition of young reading learners (extended frequency ranges, high intra-speaker variability, slow reading rate, and presence of disfluencies), the system needs to adapt to the noise environment of classrooms.
Methods: Babble noise is among the most complex types of noise because suppressing it requires the faculty to distinguish the targeted speaker’s speech from background speakers’. In this work, we apply multi-condition training to a state-of-the-art phoneme recognition model to improve its robustness to classroom-typical babble noise. This method consists of mixing the speech training recordings with babble noise at different signal-to-noise ratios (SNR) to teach the system to handle noise. Our model follows a Transformer encoder-decoder architecture. We use transfer learning to cope with the lack of data, training a source model on 150 hours of the Common Voice adult read speech dataset, and fine-tuning it with our small in-house Lalilo child read speech dataset (13 hours). Multi-condition training is applied to the fine-tuning stage, using two noise datasets: (1) the DEMAND noise dataset, where we selected only babble noise recordings; (2) our in-house Lali-noise dataset, containing real-life classroom recordings. While the first displays adult babble noise with constant volume and distribution, the second is characterized by irregular child babble noise and other classroom noises. The evaluation set is comprised of recordings with SNRs varying between -10 and 50 dB, and a mean SNR on the set of 23.8 +- 10.9 dB.
Results: Our results show that multi-condition training with babble noise indeed improves the system’s robustness to noise. We observe a global reduction of the phoneme error rate (PER) that is more important using the Lali-noise dataset than the DEMAND dataset. In particular, we observe an important PER reduction on high-noise (SNR < 10 dB) and medium-noise (10 dB < SNR < 25 dB) recordings when using the Lali-noise dataset. On the contrary, using the DEMAND dataset gives better performance on low-noise (SNR > 25 dB) recordings, at the cost of slightly degrading the recognition on high-noise recordings. These results lead us to the conclusion that training the model in conditions that resemble most the inference noise environment is most efficient to improve ASR noise robustness.