P34Session 2 (Friday 13 January 2023, 09:00-11:00)Biophysically-inspired end-to-end time-domain speech enhancement
Deep neural network (DNN) speech enhancement approaches have recently achieved great performance. There are numerous applications that benefit from the speech enhancement model, including automatic speech recognition (ASR) and hearing aids. The majority of these previous methods were developed in the time-frequency (T-F) domain. However, the T-F domain approach has some limitations, including a high minimum delay in reconstructing the signal from the T-F domain representation, poor generalizability in unseen noise, and bad performance at negative signal-to-noise ratio's (SNRs). To address these problems, we propose a biophysically inspired end-to-end time-domain neural network that adopts bio-inspired features from CoNNear, a neural network that accurately simulates some biophysical properties of the human auditory system such as sharp and level-dependent filter tuning.
We generated biophysical speech features using CoNNear and used these as input into the U-Net-based speech enhancement module. The latter module consisted of a generator network without the discriminator from SERGAN (Baby and Verhulst,2019). For training we used the INTERSPEECH 2021 DNS Challenge dataset. An objective evaluation was performed using perceptual evaluation of speech quality (PESQ), segmental SNR (segSNR), cepstral distance (CD) and log-likelihood ratio (LLR) with unseen samples from the DNS challenge, which was different from the training noise scenarios.
Objective evaluations revealed that bio-inspired features show performance comparable to T-F features at positive SNRs, with improved generalizability in negative SNRs and for mismatched noises. Additionally, our time-domain CoNNear features dramatically decreased the minimum latency of the whole system towards 4 ms, making it suitable for real-time applications with high constraints on signal delay. The good generalizability in adverse noise conditions and unseen noise, as well as the low latency of our DNN-based model show promise for application in hearing aids.
Funding: Research supported by FWO project G063821N (Machine Hearing 2.0).